Sr Specialist, Site Reliability Engineer

Description:

This role will utilize technical knowledge and analytical skills in architecting and optimizing cloud infrastructure, standardizing technology stacks for cloud and data centers, implementing cloud service catalogs and service governance solutions, and driving the implementation of a cloud-first strategy and cloud adoption in partnership with development and business stakeholders.

As a Site Reliability Engineer, this position will be part of a talented platform engineering team that demonstrates superb technical competency, delivering mission-critical infrastructure and ensuring the highest levels of availability, performance, and security. The role will be responsible for supporting and operating large-scale desktop and web application systems.

The position holds direct accountability for Technical Operations objectives and responsibility for team goals that drive and deliver performance and growth of Infrastructure and Application. The role will address problems of critical and high severity, where analysis of issues or application functionality requires a review of relevant dependency factors across systems, infrastructure, network, application, database, data, and protocols.

Technical operations and support, building and managing cloud infrastructure for WEB/API applications hosted in Windows/Linux Operating Systems across multiple data centers.
Ensuring availability, consistency, stability, and performance of application performance.
Participating in critical incident management calls with hands-on ability to troubleshoot IIS/.NET/Apache application, infrastructure, and network performance issues.
Providing 3rd-level support for escalations and communicating progress updates to stakeholders on incident resolutions.
Coordinating with Development and Tech teams on product rollouts, releases, and critical fixes while working with Performance and QA teams on performance metrics commitment review.
Coordinating with Tech teams on OS upgrades and vulnerability remediation to ensure audit and compliance.
Seeking extensive experience in observability tools, specifically with hands-on expertise in building and managing infrastructure for Grafana/Prometheus/Loki at an administrative level, with the ability to effectively manage large volumes of logs using Loki.

What We’re Looking For

Basic Required Qualifications:

Bachelor’s/Master’s Degree in Computer Science, Information Systems, or equivalent.
2 to 3 years of Site Reliability Engineering (SRE) experience.
Scripting skills in any of the following: Shell scripts, Python, Perl, PowerShell, etc.
Experience in building cloud infrastructure as code (IoC) using Terraform, Cloud Formation Templates (CFTs), etc.
Ability to perform Apache/Tomcat + J2EE installations on Linux-based systems.
Working knowledge of AWS cloud technologies: VPC, EC2, EKS, ELB, RDS, Lambda, SES, SNS, Containers, API Gateway, Docker, Kubernetes, etc.
Knowledge and admin experience with observability tools such as ELK, Grafana, Splunk.
CI/CD delivery using configuration management tools such as GitHub, VSTS, Ansible, Puppet, Chef, Salt, Jenkins, Maven, etc.
Knowledge in load-balancing and high-availability planning with BigIP, Application Load Balancers, NLB.

Organization	S&P Global
Industry	Engineering Jobs
Occupational Category	Site Reliability Engineer
Job Location	British Columbia,Canada
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2025-10-11 8:47 am
Expires on	2026-07-21