Lead Site Reliability Engineer
Lead Site Reliability Engineer
- locations
- Stratford (2 Redman Place)
- time type
- Full time
- posted on
- Posted Today
- time left to apply
- End Date: February 17, 2026 (13 days left to apply)
- job requisition id
- R032272
Modern tech-stack. Hybrid infrastructure. Reliability for 4,000+ users.
Lead Site Reliability Engineer
£64,000 - £74,000 (+ Benefits)
Grade: P3MP
Reports to: Senior Manager, Platform Engineering
Contract: Permanent
Hours: Full time 35 hours per week
Location: Stratford, London. Office-based with high flexibility (1-2 days per week in the office)
Visa sponsorship: Cancer Research UK can consider visa sponsorship for this vacancy. If this applies to you, please ensure that this is clearly marked on your application.
Closing date: 16 February 2026 23:55
This vacancy may close earlier if a high volume of applications is received or once a suitable candidate is found, therefore we strongly recommend that you apply early to avoid disappointment. If you require more time to apply as part of a reasonable adjustment, please contact recruitment@cancer.org.uk as soon as possible.
Recruitment process: Telephone interview followed by two competency-based interviews
Interview date: From the week commencing 23 February 2026
How do I apply? We operate an anonymised shortlisting process in our commitment to equality, diversity, and inclusion. CVs are required for all applications; but we won’t be able to view them until we invite you for an interview. Instead, we ask you to fully complete the work history section of the online application form for us to be able to assess you quickly, fairly, and objectively.
At Cancer Research UK, we exist to beat cancer.
We are professionals with purpose, beating cancer every day. But we need to go much further and much faster. That’s why we’re looking for someone talented, someone who wants to develop their skills, someone like you.
Cancer Research UK has an ambitious Engineering Strategy supported by a modern Tech Stack and a complex hybrid infrastructure spanning on‑premise and multi‑cloud environments.
As a Lead Site Reliability Engineer, you’ll play a vital role in shaping and advancing SRE practices across the charity. You’ll lead incident response, drive automation to reduce operational toil, and act as the escalation point for complex production issues. You’ll define meaningful Service Level Objectives, strengthen observability, and help foster a blameless, learning‑focused culture that continually improves reliability.
You’ll also lead and develop a team of Site Reliability Engineers, balancing day‑to‑day operational needs with engineering work that delivers long‑term improvements. Working closely with development teams and Platform Engineering colleagues, you’ll embed SRE principles across our services, coaching engineers and influencing technical direction to ensure reliability is built in from the start.
If you’re an SRE leader who has strengthened large‑scale production systems across complex on‑premise and AWS environments, and you’re passionate about developing and leading teams to drive meaningful change, we would love for you to join our mission.
What will I be doing?
-
Ensuring the reliability, availability, and performance of Cancer Research UK’s production services across AWS, on‑premise, and data centre environments. This includes:
-
Defining and monitoring Service Level Objectives (SLOs), error budgets, and reliability metrics.
-
Reducing incidents and operational toil through automation, engineering improvements, and continuous optimisation.
-
-
Leading incident response, promoting a blameless culture, coordinating cross‑team response, and ensuring post-mortem and follow‑up actions drive long‑term improvement.
-
Building and maintaining comprehensive monitoring, logging, alerting, and tracing capabilities.
-
Creating tools and dashboards that give teams clear visibility into system health, performance, and reliability and help them proactively identify issues.
-
-
Collaborating closely with development teams, architects, and Platform Engineering colleagues to embed reliability, observability, and operability into service design.
-
Advising on scalability, performance, capacity planning, and production readiness at scale.
-
Driving automation and toil reduction through infrastructure as code, robust CI/CD pipelines, self‑service tooling, and the removal of manual operational tasks.
-
Collaborating with the Head of Platform Engineering and peers to shape SRE strategy and practices across the organisation.
-
Championing the adoption of SRE principles (including SLOs, error budgets, capacity planning, and the balance between reliability work and feature development).
-
Using modern platform approaches (LaaS, PaaS, FaaS, containers, serverless) to balance reliability, agility, and cost‑effectiveness.
-
Producing and maintaining high-quality documentation, ensuring production systems are understood, debugged, and operated by the team and promoting knowledge sharing.
-
Defining and championing best practices for reliability, observability, incident management, and operational excellence across the organisation.
Line Management:
-
Line-managing and leading the SRE team (c.5 direct reports), coaching them to develop their skills and careers.
-
Creating an inclusive and high-performing team culture that recognises success and retains talent within the team and wider function.
-
Setting clear objectives and KPIs for the team.
-
Balancing operational demands with engineering work to ensure the team can invest in automation, reliability improvements, and skills development.
-
Mentoring engineers across Platform Engineering and development teams to strengthen operational capability and adopt SRE best practices.
-
Supporting self‑service initiatives while ensuring strong governance around reliability, security, and cost management.
What skills will I need?
-
Proven experience as a Lead Site Reliability Engineer, operating and improving large‑scale production systems across complex on‑premise and AWS cloud environments.
-
This includes troubleshooting performance issues, managing incidents, conducting post-mortems, and implementing lasting solutions that prevent recurrence.
-
-
Expert in SRE best practices with strong AWS experience and a proven record of improving reliability and reducing toil through engineering solutions across networking, storage, databases, and platform services.
-
Experience automating operational tasks and delivering self‑service capabilities using infrastructure as code and CI/CD tooling (e.g., Terraform, AWS CDK, Ansible, CloudFormation, GitHub Actions, GitLab CI).
-
Effectively troubleshot and debugged Linux/ Unix systems using Python in line with security best practices.
-
Strong observability experience (including Prometheus, Grafana, ELK/Splunk, Datadog, or CloudWatch), with the ability to design effective monitoring, alerting, and dashboards.
-
Proficiency with containerisation and orchestration (Docker, Kubernetes, ECS/Fargate) and a solid understanding of microservices, distributed systems, and service mesh technologies.
-
Background in leading engineering teams, with strong management and coaching skills and the ability to drive change and guide people through ambiguity and evolving business needs.
-
Has successfully built credible and collaborative technical and non-technical stakeholder relationships with the ability to explain complex technical issues, balance competing priorities, and influence technical decisions.
Our organisation values are designed to guide all that we do.
Bold: Act with ambition, courage and determination
Credible: Act with rigour and professionalism
Human: Act to have a positive impact on people
Together: Act inclusively and collaboratively
We’re looking for people who can believe in and embody these organisation values and can use them to drive forward progress against our mission to beat cancer.
If you’re interested in applying and excited about working with us but are unsure if you have the right skills and experience we’d still love to hear from you.
What will I gain?
We create a working environment that supports your wellbeing and provide a generous benefits package, a wide range of career and personal development opportunities and high-quality tools. Our policies and processes enable you to improve your work-life balance, take positive steps in your career and achieve your personal wellbeing goals.
You can explore our benefits by visiting our careers web page.
Additional Information
For more information about working with us please visit our website or contact us at recruitment@cancer.org.uk.
For more updates on our work and careers, follow us on: LinkedIn, Facebook, Instagram, X and YouTube.
Our vision is to create a charity where everyone feels like they belong, benefits from and participates in, the work we do. We actively encourage applications from people of all backgrounds and cultures, in particular those from ethnic minority backgrounds who are currently under-represented.
We want to see every candidate performing at their best throughout the job application process, interview process and whilst at work. We therefore ask you to inform us of any concerns you have or any adjustments you might need to enable this to happen. Please contact recruitment@cancer.org.uk or 020 3469 8400 as soon as possible.
Unfortunately, we are unable to recruit anyone below the age of 18, so that we can protect young people from health & safety and safeguarding risks.