Senior Engineering Manager – Critical Operations and Reliability Engineering

Netflix

Role Overview

C.O.R.E is the central SRE team within Infrastructure Engineering that defines and drives reliability practices for all consumer-facing app development teams. The C.O.R.E team’s mission is to improve the availability and reliability of Netflix’s infrastructure while enhancing the operational readiness of its engineering culture, focusing on incident management and operational excellence.

As the Senior Manager of the CORE Site Reliability Engineering (SRE) team, you will lead the integration of Netflix’s SRE model with industry-leading best practices. You will define and drive reliability practices for all consumer-facing product teams, ensuring that our services are reliable, scalable, and efficient. This role is pivotal in ensuring the reliability and performance of Netflix’s services, driving innovation, and optimizing system operations to support the company’s mission of revolutionizing entertainment.

Role Responsibilities

  • Strategic Leadership: You will lead & mentor the C.O.R.E SRE team while also setting the strategic vision and technical direction for worldclass system reliability, observability, and scalability.
  • Reliability: Your leadership will enable consumer-facing product teams to adopt standardized strategies for meeting reliability targets (eg SLO/SLI, error budgets etc).
  • Incident Management: You will manage high-severity incidents impacting Member Experience and/or Revenue across {SVOD, Live, Ads, Games}, conduct post-incident reviews, and provide ongoing incident trend analysis to prevent recurrence and improve system architecture.
  • Operational Excellence: You will drive down the operational cost of service ownership by optimizing system reliability and scalability via resilience experiments.
  • Automation and Tools: You will use both toward outcomes like easier deployment, monitoring, indicent response, alerting, resolution, etc.
  • Collaboration and Integration: You’ll work closely with SREs, Dev teams, and Service owners to integrate reliability practices into SDLC and manage shared accountability for service health.

Requirements

  • Proven experience in a Senior Leadership Role within Site Reliability Engineering or a related domain.
  • Substantial experience commanding high-pressure and large-scale incidents.
  • Being open to participating in an on-call rotation, with shifts covering 24/7.
  • Extensive experience with high scale cloud platforms with a strong understanding of distributed systems, networking, and software engineering.
  • Experience working in a collaborative environment, influencing stakeholders across various levels of the organization. Ability to build strong relationships with engineering, product, and business teams.
  • Strong problem-solving abilities and a proactive approach to challenges.
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related fields or equivalent work experience.

Our compensation structure consists solely of an annual salary; we do not have bonuses. You choose each year how much of your compensation you want in salary versus stock options. To determine your personal top of market compensation, we rely on market indicators and consider your specific job family, background, skills, and experience to determine your compensation in the market range. The range for this role is $480,000 – $1,200,000

Want to learn more? Visit the Netflix company profile to browse the latest job listings.

Set up job alerts and get notified about the new jobs

Similar Remote Jobs

Scroll to Top