[RFC] Language-Agnostic Custom Operator Integration into GE【免费下载链接】geGEGraph Engine是面向昇腾的图编译器和执行器提供了计算图优化、多流并行、内存复用和模型下沉等技术手段加速模型执行效率减少模型内存占用。 GE 提供对 PyTorch、TensorFlow 前端的友好接入能力并同时支持 onnx、pb 等主流模型格式的解析与编译。项目地址: https://gitcode.com/cann/geSummaryThis document proposes alanguage-agnosticmechanism for integrating custom operators into GE. By defining a unified operator integration interface, the custom operator integration process is decoupled from specific operator programming languages (Ascend C, Triton, PPTO etc.), and provides aprogressivedevelopment experience—from only supporting runtime execution, to participating in compile-time optimization, gradually gaining higher performance benefits.MotivationCurrent GE support for custom operator integration has 2 key pain points:Only supports custom operators developed in Ascend C language. With the development of diverse operator programming languages (such as Triton has high appeal in usability), users want operators developed in other languages to also integrate into GE.Graph integration deliverables are many and scattered, usability needs improvement. Developers need to simultaneously maintain proto definition, execution logic, compilation logic and other multiple files, lacking a unified deliverable organization method.Proposed DesignArchitecture ViewThrough unified development interface, connecting different programming languages. Custom operators are loaded as.sodeliverables into GE, participating in the full flow of graph compilation and execution.Progressive Capability ModelCustom operator graph integration is divided into 3 stages, development effort and performance benefits increase progressively:StageCore CapabilityNew DeliverablePerformance BenefitStage 1Execute (host schedules kernel)1 .soRunnable, has host scheduling overheadStage 2.1Execute (sink scheduling)No newEliminate host scheduling overhead under static shapeStage 2.2 InferShape CompileNo newShape inference, memory reuse, online compilationStage 3 Serialize / DeserializeNo newOffline OM deploymentPseudocode Development ExampleBelow uses Add operator as an example to demonstrate developer experience at each stage.Stage 1: Dynamic Shape Host SchedulingOnly need to implementExecute, complete kernel loading and launch:class AddCustom : public EagerExecuteOp { public: graphStatus Execute(gert::EagerOpExecutionContext *ctx) override { // 1. Get inputs auto *x ctx-GetInputTensor(0); auto *y ctx-GetInputTensor(1); // 2. Allocate output auto *z ctx-MallocOutputTensor(0, x-GetShape(), x-GetFormat(), x-GetDataType()); // 3. Load kernel binary (pre-compiled npubin / Ascend C binary) auto bin_data LoadBinary(add_kernel.npubin); auto func_handle GetKernelFunction(bin_data, add_kernel); // 4. Construct args and launch// 4. Construct args and launch int64_t n x-GetShapeSize(); int32_t block_num CeilDiv(n, BLOCK_SIZE); struct Args { const void *in0, *in1; void *out; int32_t n, gx, gy, gz; } args {x-GetAddr(), y-GetAddr(), z-GetAddr(), (int32_t)n, block_num, 1, 1}; aclrtLaunchKernelWithHostArgs(func_handle, block_num, ctx-GetStream(), nullptr, args, sizeof(args), nullptr, 0); return GRAPH_SUCCESS; } }; REG_OP(AddCustom) .INPUT(x, TensorType({DT_FLOAT, DT_FLOAT16})) .INPUT(y, TensorType({DT_FLOAT, DT_FLOAT16})) .OUTPUT(z, TensorType({DT_FLOAT, DT_FLOAT16})) .OP_END_FACTORY_REG(AddCustom); REG_AUTO_MAPPING_OP(AddCustom);Effect: Operator can run in GE graph, supports dynamic shape, but each inference step has host-side scheduling overhead.Phase 2: Static Shape SinkSupplementShapeInferOpandCompilableOpon top of phase 1:class AddCustom : public EagerExecuteOp, public ShapeInferOp, public CompilableOp { // Execute same as phase 1, omitted... graphStatus InferShape(gert::InferShapeContext *ctx) override { *ctx-GetOutputShape(0) *ctx-GetInputShape(0); return GRAPH_SUCCESS; } graphStatus InferDataType(gert::InferDataTypeContext *ctx) override { return ctx-SetOutputDataType(0, ctx-GetInputDataType(0)); } graphStatus Compile(gert::OpCompileContext *ctx) override { auto *input ctx-GetInputTensor(0); auto key BuildKey(input-GetShape()); auto source LoadFile(add_kernel.cpp); aclrtcProg prog; aclrtcCreateProg(prog, source.c_str(), add_kernel, 0, nullptr, nullptr); aclrtcCompileProg(prog, 1, options); size_t bin_size; aclrtcGetBinDataSize(prog, bin_size); device_elves_[key].resize(bin_size); aclrtcGetBinData(prog, device_elves_[key].data()); aclrtcDestroyProg(prog); return GRAPH_SUCCESS; } private: std::mapstd::string, std::vectoruint8_t device_elves_; };Effect:Phase 2.1 (no new deliverables): Static shape kernel sink scheduling, eliminates host overheadPhase 2.2: Participates in shape derivation and memory reuse, Compile phase completes operator online compilationPhase 3: Offline OM SupportSupplementPortableOpon top of phase 2:class AddCustom : public EagerExecuteOp, public ShapeInferOp, public CompilableOp, public PortableOp { // Execute / InferShape / Compile same as phase 2, omitted... graphStatus Serialize(std::vectoruint8_t buffer) override { // Serialize device_elves_ to buffer (format user-defined, GE only passes through) return SerializeBinaryMap(device_elves_, buffer); } graphStatus Deserialize(const std::vectoruint8_t buffer) override { // Restore device_elves_ from buffer return DeserializeBinaryMap(buffer, device_elves_); } };Effect: Compilation artifacts saved and restored with OM file, supportsAIR → ATC → OM → ACLoffline deployment chain.Language Common Layer Encapsulation EffectAbove infrastructure layer code about 60-80 lines. Each programming language can build common layer for further encapsulation, using Triton as example:// After using Triton common layer, same Add operator only needs ~10 lines TRITON_CUSTOM_OP(AddCustom) .Kernel(add_kernel) // Declare kernel name .Binary(add_kernel.npubin) // Declare binary path .Inputs({x, y}) // Declare inputs .Outputs({z}) // Declare outputs .InferShapeSameAsInput(0) // Output shape input 0 shape .InferDataTypeSameAsInput(0) // Output dtype input 0 dtype .TilingStrategy(TilingStrategy::ElementWise) // Auto calculate block_num .Build();Before and After Encapsulation Comparison:Repeated LogicInfrastructure Layer (manual)Language Common Layer (automatic)binary loadingManualaclrtBinaryLoadFromDataDeclare.Binary()pathargs constructionManual assemble packed structAuto generate based on kernel signatureblock_num calculationManualCeilDiv(n, BLOCK_SIZE).TilingStrategy(ElementWise)REG_OP definitionManual proto writing.Inputs()/.Outputs()auto generateInferShapeManual implementation.InferShapeSameAsInput(0)Infrastructure Positioning and Language Common LayerLayerResponsibilityMaintainerGE Infrastructure LayerUnified integration interface, registration mechanism, compile/execute callbacks, serialization protocolGE teamLanguage Common LayerEncapsulate language-specific boilerplate (binary loading, args construction etc.)Each language SDK teamOperator DevelopersOnly need to implement kernel logic few declarationsOperator developersFrontend IntegrationFrontendAdditional DeliverablesIntegration MethodGE nativeNoneREG_OP OperatorFactoryPyTorch TorchAirTORCH_LIBRARY converterFX node mapping to GE op typeTensorFlowlibcustom_ops.so npu_supported_ops.jsonTF Adapter graph construction conversion, REG_AUTO_MAPPING_OP auto generate GE protoONNXREGISTER_CUSTOM_OP parsing pluginNodeProto attribute mapping to GE OperatorOpen Questions (Issues to Discuss)Language Common Layer Standardization Level: Should common layers for each language be unified templates/SDK provided by GE, or independently maintained by each language team?Multi-version Compatibility: When GE infrastructure layer interface evolves, how to ensure old version .so deliverables still load on new GE? Need to introduce operator version field?Compile-time Parallel Safety:CustomGraphOptimizerparallel callbacksCompile, currently requires operator implementations to ensure thread safety themselves. Should framework layer provide lock mechanism?Serialization Format Standardization: CurrentlyPortableOpbuffer format completely user-defined. Should GE provide standard serialization helper tools?ONNX Custom Domain Support: Currently ONNX parsing plugin needs explicit registration for each domain::version::OpType. Should support wildcards or auto-discovery mechanism?TimelinePhaseStatusDescriptionPhase 1 (dynamic shape host scheduling)✅ CompletedSeeexamples/custom_op/triton_add_customPhase 2.1 (static shape sink only)✅ CompletedSame sample verifies sink effectPhase 2.2 (sink full benefits)✅ CompletedShape derivation, memory reuse, online compilation supportedPhase 3 (offline OM support)✅ CompletedSeeexamples/custom_op/compilable_add_customLanguage Common Layer PlannedEach language SDK team builds as neededReferencesDevelopment Guide:custom_op_development_guide.mdArchitecture Design:custom_op_architecture.mdSample Code:examples/custom_op/【免费下载链接】geGEGraph Engine是面向昇腾的图编译器和执行器提供了计算图优化、多流并行、内存复用和模型下沉等技术手段加速模型执行效率减少模型内存占用。 GE 提供对 PyTorch、TensorFlow 前端的友好接入能力并同时支持 onnx、pb 等主流模型格式的解析与编译。项目地址: https://gitcode.com/cann/ge创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考