08-MLOps与工程落地——CI/CD for ML
CI/CD for MLGitHub Actions流水线、自动化训练测试部署一、CI/CD for ML概述1.1 什么是ML CI/CDimportmatplotlib.pyplotaspltfrommatplotlib.patchesimportRectangle,FancyBboxPatchimportwarnings warnings.filterwarnings(ignore)print(*60)print(CI/CD for ML机器学习持续集成与部署)print(*60)# CI/CD流程图fig,axplt.subplots(figsize(14,8))ax.axis(off)# 阶段stages[(代码提交,0.08,0.7),(代码检查,0.25,0.7),(测试,0.42,0.7),(训练,0.59,0.7),(验证,0.76,0.7),(部署,0.93,0.7),]forname,x,yinstages:circleplt.Circle((x,y),0.07,colorlightblue,ecblack)ax.add_patch(circle)ax.text(x,y,name,hacenter,vacenter,fontsize7)ifx0.86:ax.annotate(,xy(x0.15,y),xytext(x0.08,y),arrowpropsdict(arrowstyle-,lw1))# 工具层tools{Git:(0.08,0.5),Linter/Formatter:(0.25,0.5),Pytest:(0.42,0.5),MLflow/DVC:(0.59,0.5),Model Registry:(0.76,0.5),K8s/Seldon:(0.93,0.5),}forname,(x,y)intools.items():circleplt.Circle((x,y),0.06,colorlightgreen,ecblack)ax.add_patch(circle)ax.text(x,y,name,hacenter,vacenter,fontsize6)ax.annotate(,xy(x,y0.07),xytext(x,y-0.03),arrowpropsdict(arrowstyle-,lw0.5,colorgray))ax.set_xlim(0,1)ax.set_ylim(0,1)ax.set_title(ML CI/CD流水线,fontsize14)plt.tight_layout()plt.show()print(\n ML CI/CD核心流程:)print( - 代码提交触发流水线)print( - 自动化测试验证代码)print( - 模型训练与评估)print( - 模型注册与部署)二、GitHub Actions配置2.1 基础工作流defgithub_actions_basic():GitHub Actions基础配置print(\n*60)print(GitHub Actions基础配置)print(*60)code # .github/workflows/ml_pipeline.yml name: ML Pipeline on: push: branches: [main, develop] paths: - src/** - tests/** - models/** pull_request: branches: [main] schedule: - cron: 0 2 * * * # 每天凌晨2点 env: PYTHON_VERSION: 3.9 MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: ${{ env.PYTHON_VERSION }} - name: Cache pip packages uses: actions/cachev3 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles(requirements.txt) }} restore-keys: | ${{ runner.os }}-pip- - name: Install dependencies run: | pip install -r requirements.txt pip install pytest pytest-cov flake8 black mypy - name: Lint run: | flake8 src/ black --check src/ mypy src/ - name: Run tests run: | pytest tests/ --covsrc --cov-reportxml --cov-reporthtml - name: Upload coverage uses: codecov/codecov-actionv3 with: file: ./coverage.xml flags: unittests train: needs: test runs-on: ubuntu-latest if: github.ref refs/heads/main steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: ${{ env.PYTHON_VERSION }} - name: Install dependencies run: pip install -r requirements.txt - name: Train model run: | python scripts/train.py --config configs/production.yaml env: MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }} - name: Upload model artifact uses: actions/upload-artifactv3 with: name: model path: models/ retention-days: 7 evaluate: needs: train runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Download model uses: actions/download-artifactv3 with: name: model path: models/ - name: Evaluate model run: | python scripts/evaluate.py --model models/model.pkl --test data/test.csv - name: Check threshold id: check run: | accuracy$(python scripts/get_accuracy.py) echo accuracy$accuracy $GITHUB_OUTPUT if (( $(echo $accuracy 0.85 | bc -l) )); then echo Accuracy below threshold exit 1 fi deploy: needs: evaluate runs-on: ubuntu-latest if: github.ref refs/heads/main environment: production steps: - uses: actions/checkoutv3 - name: Download model uses: actions/download-artifactv3 with: name: model path: models/ - name: Configure AWS credentials uses: aws-actions/configure-aws-credentialsv2 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-1 - name: Login to Amazon ECR id: login-ecr uses: aws-actions/amazon-ecr-loginv1 - name: Build and push Docker image run: | docker build -t ml-api:latest . docker tag ml-api:latest ${{ steps.login-ecr.outputs.registry }}/ml-api:latest docker push ${{ steps.login-ecr.outputs.registry }}/ml-api:latest - name: Deploy to ECS run: | aws ecs update-service --cluster ml-cluster --service ml-service --force-new-deployment print(code)github_actions_basic()2.2 矩阵测试defmatrix_testing():矩阵测试print(\n*60)print(矩阵测试)print(*60)code # .github/workflows/matrix_test.yml name: Matrix Testing on: push: branches: [main] jobs: test-matrix: runs-on: ubuntu-latest strategy: matrix: python-version: [3.8, 3.9, 3.10] os: [ubuntu-latest, macos-latest, windows-latest] include: - python-version: 3.9 os: ubuntu-latest coverage: true fail-fast: false steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: ${{ matrix.python-version }} - name: Install dependencies run: pip install -r requirements.txt - name: Run tests run: | pytest tests/ --junitxmltest-results-${{ matrix.python-version }}.xml - name: Upload test results uses: actions/upload-artifactv3 with: name: test-results-${{ matrix.python-version }}-${{ matrix.os }} path: test-results-*.xml if: always() test-models: runs-on: ubuntu-latest strategy: matrix: model: [random_forest, xgboost, lightgbm] dataset: [small, medium, large] steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Train and test model run: | python scripts/test_model.py --model ${{ matrix.model }} --dataset ${{ matrix.dataset }} print(code)matrix_testing()三、自动化训练流水线3.1 训练工作流deftraining_pipeline():自动化训练流水线print(\n*60)print(自动化训练流水线)print(*60)code # .github/workflows/train.yml name: Automated Training on: schedule: - cron: 0 0 * * 0 # 每周日 workflow_dispatch: inputs: model_type: description: Model type required: true default: random_forest type: choice options: - random_forest - xgboost - lightgbm n_estimators: description: Number of estimators required: false default: 100 jobs: train-weekly: runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Install dependencies run: pip install -r requirements.txt - name: Data validation run: | python scripts/validate_data.py --source s3://bucket/data/latest - name: Hyperparameter optimization run: | python scripts/hpo.py --model ${{ github.event.inputs.model_type }} - name: Train model run: | python scripts/train.py \\ --model ${{ github.event.inputs.model_type }} \\ --n_estimators ${{ github.event.inputs.n_estimators }} - name: Evaluate model run: | python scripts/evaluate.py --model models/model.pkl - name: Register model if: success() run: | python scripts/register_model.py --model models/model.pkl --version $(date %Y%m%d) - name: Notify on failure if: failure() uses: slackapi/slack-github-actionv1 with: payload: | { text: Weekly training failed!, channel: #ml-alerts } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} print(code)training_pipeline()3.2 超参数优化工作流defhpo_workflow():超参数优化工作流print(\n*60)print(超参数优化工作流)print(*60)code # .github/workflows/hpo.yml name: Hyperparameter Optimization on: workflow_dispatch: inputs: algorithm: description: Optimization algorithm default: bayesian type: choice options: - random - grid - bayesian n_trials: description: Number of trials default: 50 parallel_jobs: description: Parallel jobs default: 4 jobs: hpo: runs-on: ubuntu-latest strategy: matrix: trial: [1, 2, 3, 4, 5, 6, 7, 8] max-parallel: ${{ fromJSON(inputs.parallel_jobs) }} steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Install dependencies run: pip install -r requirements.txt - name: Run HPO trial run: | python scripts/hpo_trial.py \\ --trial ${{ matrix.trial }} \\ --algorithm ${{ github.event.inputs.algorithm }} env: MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }} - name: Upload trial results uses: actions/upload-artifactv3 with: name: trial-${{ matrix.trial }} path: results/trial_${{ matrix.trial }}.json aggregate: needs: hpo runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Download all results uses: actions/download-artifactv3 with: path: results/ - name: Aggregate results run: | python scripts/aggregate_hpo.py --results-dir results/ - name: Get best parameters id: best run: | best_params$(python scripts/get_best_params.py) echo best_params$best_params $GITHUB_OUTPUT - name: Train best model run: | python scripts/train.py --params ${{ steps.best.outputs.best_params }} - name: Register best model run: | python scripts/register_model.py --model models/best_model.pkl --alias best print(code)hpo_workflow()四、自动化部署流水线4.1 部署工作流defdeployment_pipeline():自动化部署流水线print(\n*60)print(自动化部署流水线)print(*60)code # .github/workflows/deploy.yml name: Deploy to Production on: workflow_dispatch: inputs: environment: description: Target environment required: true default: staging type: choice options: - staging - production model_version: description: Model version required: true canary_percent: description: Canary traffic percentage default: 10 jobs: deploy-staging: runs-on: ubuntu-latest if: github.event.inputs.environment staging environment: staging steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Install dependencies run: pip install -r requirements.txt - name: Download model from registry run: | python scripts/download_model.py --version ${{ github.event.inputs.model_version }} - name: Build Docker image run: | docker build -t ml-api:staging . docker tag ml-api:staging ${{ secrets.ECR_REGISTRY }}/ml-api:staging-${{ github.sha }} - name: Push to ECR run: | docker push ${{ secrets.ECR_REGISTRY }}/ml-api:staging-${{ github.sha }} - name: Deploy to Kubernetes run: | kubectl set image deployment/ml-api-staging ml-api${{ secrets.ECR_REGISTRY }}/ml-api:staging-${{ github.sha }} kubectl rollout status deployment/ml-api-staging deploy-production: runs-on: ubuntu-latest if: github.event.inputs.environment production environment: production steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Download model run: | python scripts/download_model.py --version ${{ github.event.inputs.model_version }} - name: Build Docker image run: | docker build -t ml-api:production . docker tag ml-api:production ${{ secrets.ECR_REGISTRY }}/ml-api:production-${{ github.sha }} - name: Canary deployment run: | # 部署金丝雀版本 kubectl set image deployment/ml-api-canary ml-api${{ secrets.ECR_REGISTRY }}/ml-api:production-${{ github.sha }} # 调整流量比例 kubectl patch service/ml-api -p {spec:{selector:{version:canary}}} - name: Run smoke tests run: | python tests/smoke_test.py --endpoint http://ml-api-canary - name: Full deployment if: success() run: | kubectl set image deployment/ml-api ml-api${{ secrets.ECR_REGISTRY }}/ml-api:production-${{ github.sha }} kubectl rollout status deployment/ml-api - name: Rollback on failure if: failure() run: | kubectl rollout undo deployment/ml-api print(code)deployment_pipeline()4.2 蓝绿部署defblue_green_deployment():蓝绿部署print(\n*60)print(蓝绿部署)print(*60)code # .github/workflows/blue_green.yml name: Blue-Green Deployment on: workflow_dispatch: inputs: model_version: description: Model version required: true jobs: blue-green: runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Setup kubectl uses: azure/setup-kubectlv3 - name: Configure kubectl run: | echo ${{ secrets.KUBE_CONFIG }} | base64 --decode kubeconfig export KUBECONFIGkubeconfig - name: Deploy green version run: | # 部署绿色环境新版本 kubectl apply -f deployment-green.yaml kubectl rollout status deployment/ml-api-green - name: Run integration tests on green run: | python tests/integration_test.py --endpoint ml-api-green - name: Switch traffic to green run: | # 切换流量 kubectl patch service ml-api -p {spec:{selector:{version:green}}} - name: Validate production traffic run: | python tests/smoke_test.py --endpoint ml-api --duration 300 - name: Clean up blue version if: success() run: | kubectl delete deployment ml-api-blue print(code)blue_green_deployment()五、测试策略5.1 测试工作流deftesting_strategy():测试策略print(\n*60)print(测试策略)print(*60)code # .github/workflows/testing.yml name: Comprehensive Testing on: pull_request: branches: [main] push: branches: [main] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Install dependencies run: pip install -r requirements.txt - name: Run unit tests run: | pytest tests/unit/ --covsrc --cov-reportxml - name: Upload coverage uses: codecov/codecov-actionv3 integration-tests: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Start services run: | docker-compose -f docker-compose.test.yml up -d sleep 10 - name: Run integration tests run: | pytest tests/integration/ -v - name: Stop services if: always() run: docker-compose -f docker-compose.test.yml down model-validation: runs-on: ubuntu-latest needs: integration-tests steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Train small model run: | python scripts/train.py --config configs/test.yaml - name: Validate model run: | python scripts/validate_model.py --model models/model.pkl - name: Check model fairness run: | python scripts/check_fairness.py --model models/model.pkl performance-tests: runs-on: ubuntu-latest needs: model-validation steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Run performance tests run: | pytest tests/performance/ --benchmark-only - name: Compare with baseline run: | python scripts/compare_performance.py --baseline baseline.json --current current.json print(code)testing_strategy()六、监控与通知6.1 监控工作流defmonitoring_workflow():监控工作流print(\n*60)print(监控与通知)print(*60)code # .github/workflows/monitoring.yml name: Model Monitoring on: schedule: - cron: */30 * * * * # 每30分钟 workflow_dispatch: jobs: monitor: runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Setup Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Check model performance run: | python scripts/monitor_model.py --endpoint https://api.example.com/predict - name: Detect data drift run: | python scripts/detect_drift.py --reference data/reference.csv --current data/current.csv - name: Send metrics to Datadog run: | curl -X POST https://api.datadoghq.com/api/v1/series \ -H Content-Type: application/json \ -H DD-API-KEY: ${{ secrets.DATADOG_API_KEY }} \ -d metrics.json alert: runs-on: ubuntu-latest needs: monitor if: failure() steps: - name: Send Slack alert uses: slackapi/slack-github-actionv1 with: payload: | { text: Model monitoring alert! Performance degraded., blocks: [ { type: section, text: { type: mrkdwn, text: *Model Performance Alert*\\nModel accuracy has dropped below threshold. } } ] } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} - name: Create Jira ticket uses: atlassian/gajira-createv3 with: project: ML issuetype: Bug summary: Model performance degradation detected description: Model accuracy dropped below threshold. Investigation required. env: JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }} JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }} JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }} print(code)monitoring_workflow()七、总结阶段工具作用代码检查flake8, black, mypy代码质量单元测试pytest功能验证集成测试pytest, docker-compose系统集成模型训练MLflow, DVC模型生产模型验证pytest, custom scripts性能验证部署kubectl, helm服务部署监控Prometheus, Datadog持续监控最佳实践使用矩阵测试覆盖多环境实施蓝绿部署减少停机时间自动化模型验证和回滚集成监控和告警记录所有部署历史
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2592796.html
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!