Reliability Engineering Specialist

 

Description:

As a Senior Site Reliability Engineer in Supply Chain Management (SCM) – Make & Deliver, you will ensure that SAP Digital Manufacturing and SAP Logistics Management operate reliably and efficiently at scale. These solutions support critical manufacturing and logistics processes worldwide, built on SAP BTP, Kubernetes, and multicloud environments. In this role, you act as an Enablement Advocate within the organization: partnering with development teams to review architecture for resiliency, enforce reliability guardrails, and integrate observability and performance standards into the design process. Beyond operational excellence, you will also help develop and integrate AIOps tools for smarter monitoring and automated remediation, ensuring reliability is embedded across the lifecycle. You’ll contribute to incident response for high severity events and drive automation that reduces complexity, enabling teams to deliver services that meet reliability goals by default.

What You’ll Do
 

  • Define and maintain SLIs/SLOs for critical services; apply error budgets to guide release decisions.
  • Collaborate with development teams to embed resiliency patterns and reliability guardrails into architecture and code.
  • Contribute to incident response for high severity events; support root cause analysis and post-incident improvements.
  • Establish and evolve observability standards (logging, metrics, tracing) and build actionable dashboards and alerts.
  • Drive performance and scalability improvements through load testing, capacity planning, and CI/CD performance gates.
  • Automate operational tasks using Infrastructure-as-Code (Terraform/Helm), pipelines, and scripts to reduce toil.
  • Advance AIOps capabilities for anomaly detection, smarter alerting, and faster remediation.
  • Partner across teams to provide guidance, reviews, and golden paths for reliability by default.
     

TECH YOU’LL USE (DAY TO DAY)
 

  • Cloud & Platform: Kubernetes, Docker, SAP BTP, AWS/Azure services.
  • Automation & Development: CI/CD pipelines (GitHub Actions / Azure DevOps), Infrastructure as Code (Terraform/Helm), scripting, and integration into dev workflows.
  • Observability: Logging, metrics, tracing tools; Dynatrace, Kibana/Elastic, Prometheus, OpenTelemetry.
  • Data & Messaging: Confluent Kafka, SAP HANA
  • Performance Testing: Load and stress testing tools (e.g., JMeter, k6).
  • Languages: TypeScript, Python, Bash, Java.
     

What You’ll Bring
 

  • 6-10+ years in SRE, DevOps, or production operations for distributed systems.
  • Proven experience with incident response and root cause analysis for high severity events.
  • Strong skills in observability, performance engineering, and automation.
  • Hands on expertise in Kubernetes cluster management and troubleshooting.
  • Ability to model load, run stress tests, analyze bottlenecks, and plan capacity.
  • Proficiency in CI/CD and Infrastructure as Code, with ability to influence development practices.
  • Excellent collaboration and communication skills to partner with development and product teams.
     

NICE TO HAVE
 

  • Familiarity with AIOps concepts (AI‑driven anomaly detection, predictive alerting, automated remediation).
  • Hands-on experience with LLM Agents frameworks (e.g. LangGraph or similar) for automation or reliability tooling.
  • Certifications in Kubernetes, SAP BTP, or Dynatrace.
  • Experience with the manufacturing domain.
     

EDUCATION & WORK STYLE
 

  • Bachelor’s degree in computer science, Engineering, or equivalent experience.
  • Curious, proactive, and data‑driven; comfortable mentoring and promoting best practices.
  • Travel: Occasional (up to 0–10%) for team workshops or cross‑site collaboration.
  • On‑call: Participation in a healthy rotation with continuous improvement focus.

Organization SAP
Industry Engineering Jobs
Occupational Category Reliability Engineering Specialist
Job Location Ontario,Canada
Shift Type Morning
Job Type Full Time
Gender No Preference
Career Level Experienced Professional
Experience 6 Years
Posted at 2025-12-27 4:54 pm
Expires on 2026-02-10