CANN/asc-devkit HCCL算法分析器指南
Algorithm Analyzer User Guide【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkitTool IntroductionThe HCCL algorithm analyzer simulates HCCL algorithm execution in an offline environment. It verifies algorithm logic and memory operations, and efficiently executes test tasks to meet developer requirements.Principle IntroductionKey Points:The algorithm analyzer stubs the dependencies (hcomm and runtime interfaces) of the HCCL single operator execution flow. During algorithm execution, it captures Task sequences from all ranks.It organizes Task information from all ranks into adirected acyclic graph.It performs validations based ongraph algorithms, such as memory read-write conflict validation and semantic validation.Memory conflict validation analyzes whether potential read-write conflicts exist based on synchronization in the graph.Semantic validation simulates Task graph execution and recordsdata transfer information. After simulation completes, it checks whether thedata transfer informationin UserOutput memory meets the operator requirements.Environment PreparationFollow the environment preparation, source code download, compilation, and installation steps in Source Code Build to prepare for algorithm analyzer compilation.Test Case WritingLLT Test Case OverviewAn algorithm checker test case consists of 5 steps, as shown below. The following sections describe how to write each step to accommodate different operator requirements. Finally, it explains how to use the checker tool for issue diagnosis.LLT Test Case Step DetailsSimulation Model InitializationTopoMeta Structure IntroductionThe checker uses TopoMeta to represent a topology. TopoMeta is a three-layer vector structure.PhyDeviceId represents the physical ID of an NPU.ServerMeta consists of PhyDeviceIds and represents the number of cards in a server and their corresponding PhyDeviceIds.SuperPodMeta consists of ServerMetas and represents the servers that form a super node.TopoMeta represents the overall topology of the cluster.TopoMeta Generation MethodsThere are two ways to generate TopoMeta:Specify the number of super nodes, servers, and cards per server, then use the provided GenTopoMeta function to generate it. This applies to symmetric topology scenarios.Fully customize super nodes, servers, and card counts. This applies to both symmetric and asymmetric topology scenarios, as shown below.Model InitializationPass in the generated TopoMeta and specify the device type for simulation.Operator Parameter SettingsOperator Execution ParametersUsing Scatter as an example, you need to set some input parameters for executing the HcclScatter operator and validation. The specific parameters are:root: Set the root node. The Scatter operation distributes data from the root node in the communication domain evenly to other Ranks.rankSize: The number of Ranks participating in collective communication in this communication domain (must be consistent with the number of cards in topoMeta).recvCount: The amount of data each Rank receives from the root node.dataType: The data type corresponding to recvCount.For other operators or custom operator scenarios, set parameters according to the operator requirements.Set Environment VariablesEnvironment variables affect judgment logic in the code. Use the setenv function to set the required conditions before test case execution.Important NotesSupported operators: Currently only the scatter operator is supported.Supported modes: Currently only OPBASE single operator mode is supported.Supported device types: Currently only DEV_TYPE_910B and DEV_TYPE_91093 (represents DEV_TYPE_910C) are supported.Operator Execution FlowAs shown below, run the single operator flow in a multi-threaded manner.Construct operator input parameters.Construct the parameters required for single operator execution, including:SetDevice: Binds a thread to a Rank so that each thread simulates a corresponding Rank.Main stream resource creation: Call the aclrtCreateStream interface, with stub implementation to simulate stream resource creation.Communication domain initialization: Call HcclCommInitClusterInfo, with stub implementation to simulate communication domain creation.Input/output memory allocation: Call aclrtMalloc, with stub implementation to simulate memory creation and mark memory types. Users must calculate the required memory in bytes based on operator type, quantity, and data type.Operator dispatch.Call the HcclScatter operator and pass in the constructed parameters above. For custom operator scenarios, replace this with the custom operator API and modify the operator parameters above to match the custom operator requirements.Communication domain destruction.Call the HcclCommDestroy interface to destroy the communication domain.Result Graph ValidationGet the Task queue from all Ranks and call the corresponding operator validation function. For the Scatter operator, call CheckScatter and pass in the Task queue and the parameters required for Scatter operator validation. The gtest framework prints based on the validation result return value.Resource CleanupThe final step of a single test case execution is to clean up simulation model resources to avoid interference with the next test case execution.Test Case Filtering and DebuggingWhen there are many test cases and you only need to execute one, modify the test case name in main.cc.Test Case Compilation and ExecutionCompile and execute algorithm analyzer test cases:# Enter algorithm analyzer directory /hccl/test/st/algorithm cd ./hccl/test/st/algorithm # Compile test cases and automatically execute bash build.shResult ExampleTest case execution results are shown below:The meaning of each field:[run]: Indicates the test case being executed for validation[OK]: Indicates successful execution, validation passed[FAIL]: Indicates execution failure. Analyze the specific reason based on console logs.Issue DiagnosisMemory Conflict Validation Diagnosis MethodIssue PhenomenonMemory conflicts occur when a memory region between two synchronization signals is written concurrently by multiple tasks, or is written while being read. In actual runtime environments, this typically manifests as randomly occurring precision issues.Under the current Mesh structure, if a Reduce operator exists, false positives may occur. The reason is that under Mesh structure, a memory block may be written by other cards simultaneously within one synchronization. Hardware ensures the atomicity of Reduce operations, so no precision issues occur in actual runtime. However, from the checkers perspective, multiple read-write operations on the same memory between two synchronizations are detected, so it is flagged as an error.Except for the above scenario, if the following error appears, it indicates a memory conflict risk in task scheduling:[1]there is memory use confilict in two SliceMemoryStatus [2]one is startAddr is 0, size is 3200, status is WRITE. [3]another is startAddr is 0, size is 3200, status is WRITE. [4]failed to check memory BufferType::OUTPUT_CCL [5]memory conflict between node [rankId:1, queueId:0, index:1] and node [rankId:2, queueId:0, index:1] [6]check rank memory conflict failed for rank 0Lines 2 and 3 indicate the start address (startAddr), size, and read/write status (status) of the two conflicting memory blocks.status has two states: READ and WRITE. READ indicates the memory block is being read, WRITE indicates the memory block is being written. Being read and being written are abstract memory operation semantics, not just write task and read task.Memory blocks that may be in READ status include: localcopy task src, read task src, write task src. Memory blocks that may be in WRITE status include: localcopy task dst, read task dst, write task dst.Line 4 indicates the type of the conflicting memory block.Line 5 indicates which two tasks caused the memory conflict.Line 6 indicates the rank number where the memory conflict occurred.The above error log indicates that two tasks are simultaneously performing write operations to the range 0-3200 of OUTPUT_CCL type.Diagnosis MethodBased on the error log, find the two tasks that caused the memory conflict and investigate the synchronization scheduling before and after these two tasks.The error log in Issue Phenomenon indicates that two tasks are simultaneously performing write operations to the range 0-3200 of OUTPUT_CCL type.Semantic Validation Failure Diagnosis MethodSemantic Validation Basic ConceptsThe algorithm analyzer uses relative addresses to represent memory, composed of three fields: memory type, offset address, and size, represented by the DataSlice struct:class DataSlice { public: // Some method functions private: BufferType type; u64 offset; u64 size; }Memory supports types such as Input, Output, and CCL.Collective communication algorithms involve complex data transfer and reduction operations during execution. The algorithm analyzer usesBufferSemanticto recorddata transfer relationships, which includes a destination memory expression and multiple source memory expressions. The destination memory is represented by member variables startAddr and Size. The source memory is represented by the SrcBufDes struct, defined as follows:struct BufferSemantic { u64 startAddr; mutable u64 size; // Size, source and destination memory share the same size mutable bool isReduce; // Whether reduction is performed, true when srcBufs has multiple entries mutable HcclReduce0p reduceType; // Type of reduction operation mutable std::setSrcBufDes srcBufs; // Which rank(s) this data comes from }; struct SrcBufDes { RankId rankId; // Source rankId BufferType bufType; // Source memory type mutable u64 srcAddr; // Offset address relative to source memory type };Semantic Calculation ExampleThe following example explains what semantic calculation is.Initial state: There are two Ranks, Rank0 and Rank1, with two memory types, Input and Output.State one action: Transfer the data block from rank0s Input with offset address 20 and size 30 to rank0s Output with offset address 35. Result: A semantic block is generated on rank0s Output, recording this transfer information.State two action: Transfer the data block from rank1s Input with offset address 70 and size 15 to rank0s Output with offset address 50. Result: The destination memory overlaps with an existing semantic block, requiring the existing semantic block to be split, generating two semantic blocks.Result ValidationDuring semantic analysis execution, many semantic blocks are generated (recording many data transfer relationships). After execution completes, validate whether the semantic blocks in Output memory meet expectations.The following example uses 2-rank AllGather to illustrate normal and abnormal scenarios for semantic blocks in Rank0s Output memory. Assume input data size is 100 bytes.Correct Scenario:Error Scenario:Diagnosis ApproachThe semantic validation phase can detect two types of errors:Missing data.Incorrect data source.Extended to reduction scenarios, similar issues exist, such as missing ranks participating in reduction, inconsistent data offset addresses participating in reduction, and so on. Normally, when semantic errors occur, the system provides certain hints. You need to use these hints combined with the task sequence printed by the algorithm analyzer for specific analysis.【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2634335.html
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!