- Responsible for ensuring the high availability, reliability, and performance of Azure-based AI cloud platform
- Lead proactive monitoring, outage detection, and incident response minimize downtime and operational risk
- Design and maintain disaster recovery and business continuity processes to safeguard critical AI workloads
- Oversee cybersecurity operations, including vulnerability management, audits, and compliance with security standards for the AI platform
- Collaborate closely with MLOps, LLMOps, and engineering teams to integrate automation, observability, and security best practices into platform operation
skills and experience required
- Bachelor’s degree in Computer Science or equivalent
- Minimum 5 years of experience in cloud administration and/or operations
- Deep expertise in Azure operations and monitoring services including Azure Monitor, Log Analytics, Application Insights
- Strong background in incident management, SRE practices, and disaster recovery design
- Hands-on experience with cloud security operations: IAM, SIEM/SOAR, vulnerability management, firewalls, endpoint detection
- Proficiency in infrastructure-as-code (Terraform, Bicep, ARM) and automation scripting (PowerShell, Python)
- Familiarity with AI/ML infrastructure (AKS, GPU VMs, data pipelines,
model hosting) and their operational demands
To apply online please use the apply function, alternatively you may contact Chloe Chen at chloe.chen(@)randstad.com.sg. (EA: 94C3609 /R1768253)