Operation Engineer (closed)

Menlo Park, CA
Competitive compensation
Recruiter Comment: I have a great job opportunity available -its at Facebook - know anyone who might be a good fit?
Job Description

The Site Reliability Operations (SRO) team is responsible for ensuring that Facebook is available, reliable and fast at all times. The SRO team is tasked with determining the source and nature of problems with the Facebook site, API, or infrastructure and either fixing it directly or marshaling internal resources to solve the problem. SRO provides Facebook with 24/7 oversight and incident response through small teams in Dublin, Ireland and Silicon Valley. SRO's remit is among the broadest in the company, with responsibility for almost every service offered, working closely with cutting-edge technologies on some of the largest deployments in the world. A member of the SRO team must be able to multi-task among several concurrent problems, performing triage and prioritization as necessary, with a strong eye for detail and the ability to work well under pressure. This person is a self-starter with a strong sense of responsibility and problem ownership who can commit to driving issues to completion; someone who can adapt quickly, gluing together working solutions across a broad technology stack that can be handed back to engineering teams for a long-term fix. The team also performs software installations, manages configuration changes for the deployed systems, and develops system management scripts and tools.


Responsibilities
  • Maintain the availability and performance of the Facebook site and APIs used by third-party services, as well as the various internal services and systems that these core interfaces depend upon
  • Develop scripts for our automated remediation framework, and build tools to help service owners and fellow SROs maintain the site and core services
  • Learn new technologies and master the Facebook infrastructure so that you can provide 'full stack' diagnostics, when necessary, to help determine the root cause of internal problems
  • Communicate effectively, and describe problems succinctly with sufficient detail that you can hand-off an ongoing problem to another team or a peer for completion
  • accurate information transfer and positive engagement with other teams is a vital SRO responsibility
Requirements
  • Strong troubleshooting skills that range from diagnosing low-level hardware issues to large scale failures within datacenter clusters
  • In-depth knowledge of Linux and TCP/IP
  • Scripting/programming proficiency: many of our tools are command-line based and the ability to quickly manipulate text and file input/output is a must. Most tools are developed using Bash and Python, but Ruby/Perl/PHP experience and interest in learning Python is sufficient
  • Must posses good written/verbal communication skills
  • Ability to rapidly assess, analyze, and resolve complicated problems with little initial information or direction and with varying degrees of ambiguity
  • This is a shifted position. Candidates must be able to work 10-hour shifts 4 days per week