Job Summary
The Site Reliability Engineer intern will support in applying software engineering principles to IT operations to ensure the company’s platforms are reliable, scalable, observable, and efficient. Their role focuses on automation, monitoring, incident management, infrastructure as code, and measurable reliability targets (SLIS/SLOs) to guarantee high availability and performance across all products.
Duties and Responsibilities
- Assist in design, implement, and continuously improve system reliability, availability, and performance by assisting in defining and monitoring SLIS,
- SLOS, and error budgets across all assigned platforms.
- Support in building and managing a robust monitoring and observability framework using Prometheus, Grafana, and Loki to track latency, traffic, errors, system health, and user impact.
- Assist in automating infrastructure provisioning, scaling, and configuration management using Infrastructure as Code principles with Terraform and Kubernetes to ensure consistency, scalability, and disaster recovery readiness.
- Participate in incident response processes, including detection, escalation, resolution, communication, and conducting blameless postmortems to prevent recurrence.
- Assist in reduce manual operational workload through automation, scripting, and process optimization to improve efficiency and release velocity.
- Support in ensuring high availability and performance of business- critical systems.
- Collaborate with Engineering, Product, and DevOps teams to assist in improving deployment safety, capacity planning, cost optimization, and system scalability.
- Support in ensuring high availability and performance of business- critical systems.
- Assist in establishing alerting strategies and reliability standards that minimize alert fatigue while ensuring rapid detection and resolution of production issues.
Required Knowledge, Qualification and Experience
- Bachelor’s Degree in Computer Science, Information Technology, or a related field.
- Some exposure in Kubernetes and Cloud networking.
- some experience with monitoring and observability tools.
- Good exposure managing production systems in cloud environments.
- Some exposure in implementing and managing CI/CD pipelines and utilizing tools like Jenkins, GitLab CI/CD, or equivalent.
- Some exposure with cloud platforms (AWS, Azure, Google Cloud) and containerization tools like Docker and Kubernetes.
- Basic hands-on exposure to monitoring and metrics systems such as Prometheus.
- Basic familiarity with dashboarding and visualization tools such as Grafana. Foundational understanding of log aggregation systems such as Loki.
- Familiarity with Linux environments and basic system commands. Exposure to scripting concepts using Python, Bash, or similar languages
- Foundational knowledge of Artificial Intelligence (AI) and good exposure with Al agents; relevant certifications in Al or related disciplines will be an added advantage.
How to Apply
Send resume and portfolio with subject SITE RELIABITY ENGINEER INTERN to recruiting@interintel.co.ke
Submission deadline: 9th March 2026
