GE语言无关自定义算子集成RFC-尧图网站建设

[RFC] Language-Agnostic Custom Operator Integration into GE【免费下载链接】geGEGraph Engine是面向昇腾的图编译器和执行器提供了计算图优化、多流并行、内存复用和模型下沉等技术手段加速模型执行效率减少模型内存占用。 GE 提供对 PyTorch、TensorFlow 前端的友好接入能力并同时支持 onnx、pb 等主流模型格式的解析与编译。项目地址: https://gitcode.com/cann/geSummaryThis document proposes alanguage-agnosticmechanism for integrating custom operators into GE. By defining a unified operator integration interface, the custom operator integration process is decoupled from specific operator programming languages (Ascend C, Triton, PPTO etc.), and provides aprogressivedevelopment experience—from only supporting runtime execution, to participating in compile-time optimization, gradually gaining higher performance benefits.MotivationCurrent GE support for custom operator integration has 2 key pain points:Only supports custom operators developed in Ascend C language. With the development of diverse operator programming languages (such as Triton has high appeal in usability), users want operators developed in other languages to also integrate into GE.Graph integration deliverables are many and scattered, usability needs improvement. Developers need to simultaneously maintain proto definition, execution logic, compilation logic and other multiple files, lacking a unified deliverable organization method.Proposed DesignArchitecture ViewThrough unified development interface, connecting different programming languages. Custom operators are loaded as.sodeliverables into GE, participating in the full flow of graph compilation and execution.Progressive Capability ModelCustom operator graph integration is divided into 3 stages, development effort and performance benefits increase progressively:StageCore CapabilityNew DeliverablePerformance BenefitStage 1Execute (host schedules kernel)1 .soRunnable, has host scheduling overheadStage 2.1Execute (sink scheduling)No newEliminate host scheduling overhead under static shapeStage 2.2 InferShape CompileNo newShape inference, memory reuse, online compilationStage 3 Serialize / DeserializeNo newOffline OM deploymentPseudocode Development ExampleBelow uses Add operator as an example to demonstrate developer experience at each stage.Stage 1: Dynamic Shape Host SchedulingOnly need to implementExecute, complete kernel loading and launch:class AddCustom : public EagerExecuteOp { public: graphStatus Execute(gert::EagerOpExecutionContext *ctx) override { // 1. Get inputs auto *x ctx-GetInputTensor(0); auto *y ctx-GetInputTensor(1); // 2. Allocate output auto *z ctx-MallocOutputTensor(0, x-GetShape(), x-GetFormat(), x-GetDataType()); // 3. Load kernel binary (pre-compiled npubin / Ascend C binary) auto bin_data LoadBinary(add_kernel.npubin); auto func_handle GetKernelFunction(bin_data, add_kernel); // 4. Construct args and launch// 4. Construct args and launch int64_t n x-GetShapeSize(); int32_t block_num CeilDiv(n, BLOCK_SIZE); struct Args { const void *in0, *in1; void *out; int32_t n, gx, gy, gz; } args {x-GetAddr(), y-GetAddr(), z-GetAddr(), (int32_t)n, block_num, 1, 1}; aclrtLaunchKernelWithHostArgs(func_handle, block_num, ctx-GetStream(), nullptr, args, sizeof(args), nullptr, 0); return GRAPH_SUCCESS; } }; REG_OP(AddCustom) .INPUT(x, TensorType({DT_FLOAT, DT_FLOAT16})) .INPUT(y, TensorType({DT_FLOAT, DT_FLOAT16})) .OUTPUT(z, TensorType({DT_FLOAT, DT_FLOAT16})) .OP_END_FACTORY_REG(AddCustom); REG_AUTO_MAPPING_OP(AddCustom);Effect: Operator can run in GE graph, supports dynamic shape, but each inference step has host-side scheduling overhead.Phase 2: Static Shape SinkSupplementShapeInferOpandCompilableOpon top of phase 1:class AddCustom : public EagerExecuteOp, public ShapeInferOp, public CompilableOp { // Execute same as phase 1, omitted... graphStatus InferShape(gert::InferShapeContext *ctx) override { *ctx-GetOutputShape(0) *ctx-GetInputShape(0); return GRAPH_SUCCESS; } graphStatus InferDataType(gert::InferDataTypeContext *ctx) override { return ctx-SetOutputDataType(0, ctx-GetInputDataType(0)); } graphStatus Compile(gert::OpCompileContext *ctx) override { auto *input ctx-GetInputTensor(0); auto key BuildKey(input-GetShape()); auto source LoadFile(add_kernel.cpp); aclrtcProg prog; aclrtcCreateProg(prog, source.c_str(), add_kernel, 0, nullptr, nullptr); aclrtcCompileProg(prog, 1, options); size_t bin_size; aclrtcGetBinDataSize(prog, bin_size); device_elves_[key].resize(bin_size); aclrtcGetBinData(prog, device_elves_[key].data()); aclrtcDestroyProg(prog); return GRAPH_SUCCESS; } private: std::mapstd::string, std::vectoruint8_t device_elves_; };Effect:Phase 2.1 (no new deliverables): Static shape kernel sink scheduling, eliminates host overheadPhase 2.2: Participates in shape derivation and memory reuse, Compile phase completes operator online compilationPhase 3: Offline OM SupportSupplementPortableOpon top of phase 2:class AddCustom : public EagerExecuteOp, public ShapeInferOp, public CompilableOp, public PortableOp { // Execute / InferShape / Compile same as phase 2, omitted... graphStatus Serialize(std::vectoruint8_t buffer) override { // Serialize device_elves_ to buffer (format user-defined, GE only passes through) return SerializeBinaryMap(device_elves_, buffer); } graphStatus Deserialize(const std::vectoruint8_t buffer) override { // Restore device_elves_ from buffer return DeserializeBinaryMap(buffer, device_elves_); } };Effect: Compilation artifacts saved and restored with OM file, supportsAIR → ATC → OM → ACLoffline deployment chain.Language Common Layer Encapsulation EffectAbove infrastructure layer code about 60-80 lines. Each programming language can build common layer for further encapsulation, using Triton as example:// After using Triton common layer, same Add operator only needs ~10 lines TRITON_CUSTOM_OP(AddCustom) .Kernel(add_kernel) // Declare kernel name .Binary(add_kernel.npubin) // Declare binary path .Inputs({x, y}) // Declare inputs .Outputs({z}) // Declare outputs .InferShapeSameAsInput(0) // Output shape input 0 shape .InferDataTypeSameAsInput(0) // Output dtype input 0 dtype .TilingStrategy(TilingStrategy::ElementWise) // Auto calculate block_num .Build();Before and After Encapsulation Comparison:Repeated LogicInfrastructure Layer (manual)Language Common Layer (automatic)binary loadingManualaclrtBinaryLoadFromDataDeclare.Binary()pathargs constructionManual assemble packed structAuto generate based on kernel signatureblock_num calculationManualCeilDiv(n, BLOCK_SIZE).TilingStrategy(ElementWise)REG_OP definitionManual proto writing.Inputs()/.Outputs()auto generateInferShapeManual implementation.InferShapeSameAsInput(0)Infrastructure Positioning and Language Common LayerLayerResponsibilityMaintainerGE Infrastructure LayerUnified integration interface, registration mechanism, compile/execute callbacks, serialization protocolGE teamLanguage Common LayerEncapsulate language-specific boilerplate (binary loading, args construction etc.)Each language SDK teamOperator DevelopersOnly need to implement kernel logic few declarationsOperator developersFrontend IntegrationFrontendAdditional DeliverablesIntegration MethodGE nativeNoneREG_OP OperatorFactoryPyTorch TorchAirTORCH_LIBRARY converterFX node mapping to GE op typeTensorFlowlibcustom_ops.so npu_supported_ops.jsonTF Adapter graph construction conversion, REG_AUTO_MAPPING_OP auto generate GE protoONNXREGISTER_CUSTOM_OP parsing pluginNodeProto attribute mapping to GE OperatorOpen Questions (Issues to Discuss)Language Common Layer Standardization Level: Should common layers for each language be unified templates/SDK provided by GE, or independently maintained by each language team?Multi-version Compatibility: When GE infrastructure layer interface evolves, how to ensure old version .so deliverables still load on new GE? Need to introduce operator version field?Compile-time Parallel Safety:CustomGraphOptimizerparallel callbacksCompile, currently requires operator implementations to ensure thread safety themselves. Should framework layer provide lock mechanism?Serialization Format Standardization: CurrentlyPortableOpbuffer format completely user-defined. Should GE provide standard serialization helper tools?ONNX Custom Domain Support: Currently ONNX parsing plugin needs explicit registration for each domain::version::OpType. Should support wildcards or auto-discovery mechanism?TimelinePhaseStatusDescriptionPhase 1 (dynamic shape host scheduling)✅ CompletedSeeexamples/custom_op/triton_add_customPhase 2.1 (static shape sink only)✅ CompletedSame sample verifies sink effectPhase 2.2 (sink full benefits)✅ CompletedShape derivation, memory reuse, online compilation supportedPhase 3 (offline OM support)✅ CompletedSeeexamples/custom_op/compilable_add_customLanguage Common Layer PlannedEach language SDK team builds as neededReferencesDevelopment Guide:custom_op_development_guide.mdArchitecture Design:custom_op_architecture.mdSample Code:examples/custom_op/【免费下载链接】geGEGraph Engine是面向昇腾的图编译器和执行器提供了计算图优化、多流并行、内存复用和模型下沉等技术手段加速模型执行效率减少模型内存占用。 GE 提供对 PyTorch、TensorFlow 前端的友好接入能力并同时支持 onnx、pb 等主流模型格式的解析与编译。项目地址: https://gitcode.com/cann/ge创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

GE语言无关自定义算子集成RFC

相关新闻

Gradle Docker Run插件：如何在Gradle中快速运行和管理Docker容器

Grafonnet-lib 7.0新特性：探索面板与模板的强大功能

CANN推理适配DeepSeek-OCR-2

Heya多语言支持：利用I18n实现国际化邮件序列的最佳实践

VisProg视觉解释器深度剖析：COUNT/Loc/VQA模块的工作原理与实现

开源电池管理系统诊断与解锁：从协议解析到工具实践

CANN/asc-devkit：LoadData 2D矩阵搬运V2 API文档

uarch-bench入门教程：从安装到运行你的第一个微架构基准测试

如何使用Dev Proxy模拟LLM服务故障与令牌限流

管理者的六个层次

AI Coding 六个月真实ROI账本：产品经理的血泪教训，研发的冷静忠告

审计来了，数据权限全开——审计走了，怎么确保权限全部关掉？

终极指南：如何将JSXBIN二进制文件转换为可读JSX源代码

终极指南：如何彻底重置Navicat Mac版14天试用期

AI视频编辑自动化：基于文本转录与智能体协作的video-use实践指南

utcpio社区生态：参与openEuler开源项目的完整指南

抖音无水印下载终极指南：douyin-downloader让你快速保存任何视频

0.69B参数实现中文多模态AI：揭秘Qwen3-SmVL模型融合技术的完整实战指南