Job Description
Responsibilities
- Own operational health dashboards, alert thresholds, and incident response playbooks for the cloud platform
- Lead on‑call rotations, coordinate major incident resolution, and drive post‑incident reviews
- Implement and maintain Disaster Recovery (DR) solutions for core applications, including DNS routing strategies and low‑RTO repositories
- Manage patching pipelines, golden images, container registries, backups, and automated resilience testing
- Partner with platform engineers to feed operational learnings into architecture improvements and the roadmap
- Use automation and AI‑assisted tools to correlate anomalies, reduce noise, and accelerate root‑cause discovery
- Educate product teams on DR patterns, operational best practices, and shared responsibilities
Requirements
- Bachelor's or Master's degree in Computer Science, Computer Engineering, or equivalent professional experienc...
Ready to Apply?
Submit your application today and join our talented team at EPAM Systems.
Submit ApplicationJob Details
- Location moreno, buenos aires
- Job Type Full-time
- Category Informática y tecnología
- Posted Date June 08, 2026
- Application Deadline July 18, 2026