1+ months

Site Reliability Engineer

Lawrence Berkeley National Laboratory
Berkeley, California 94720
  • Job Type
  • Job Status
    Full Time

Site Reliability Engineer - 90387

Organization: NE-NERSC


We are the National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory. NERSC is the primary scientific computational facility for the Office of Science in the US Department of Energy (DOE). We accelerate scientific discovery with our world class computational systems, data analytics and high-speed network.


We are looking to hire individuals to join our team of Site Reliability Engineers to ensure that NERSC is accessible, reliable, secure and available to our users on a 24x7 basis.


What You Will Do:

• Apply working knowledge of clustered Linux systems to manage the reliability of the NERSC facility in three areas: computation, data storage, and the facility environment, to enable the continuous scientific progress of the users.

• Apply demonstrated skills as a Linux Systems Administrator and a site reliability engineer.

• Work on and solve problems of diverse scope related to maintaining critical services functioning, creating alerting, notification and problem-solving programs to prevent problem recurrence with the goal of automating the response to all routine service conditions.

• Under the guidelines of the group’s project manager, collaborate with others in developing and maintaining diagnostic tools used to support the HPC community within NERSC using programming languages like C, C++, python, java or Perl using knowledge of standard software development practices.

• Using knowledge of an incident management platform, develop tools that automate tasks, alerts, notifications within the platform.

• Using knowledge of the Facility Operations processes, provide input in the design of software tools, workflows and new processes that continuously improve the diagnostic capabilities of the group to ensure the availability of the HPC services provided by NERSC.

• Collaborate with other groups in the testing and implementation of new diagnostic tools, workflows and new capabilities for providing high availability for the systems in production. Write the documentation necessary for these new tools and provide training for staff in their use.

• Provide accurate information in the trouble ticketing system for outages, maintenance and other incidents such that the workflow and protocols can be appropriately tracked by others.

• Work closely in cooperation with other NERSC groups to manage maintenance, perform tasks like upgrades, shut down batch queues, and manage diagnostic and notification software or generally manage a center­wide outage.

• In cooperation with Lab Fire and Safety, play a primary role in investigating, assessing and confirming fire-related alarms to protect and minimize damage to the assets of the HPC floor.

• Conduct periodic on­call duties as necessary to support a 24x7 workflow.


The posting shall remain open until the position is filled, however for full consideration, please apply by June 30, 2020.



• This is a 2 year, full-time, term appointment with the possibility of extension or conversion to Career appointment based upon satisfactory job performance, continuing availability of funds and ongoing operational needs.

• This position supports a 24x7 operation. The work schedule may include, day, swing, night shift or weekend hours depending on operational needs and a 24x7 on-call rotation. The current opening is for night shift and weekend hours.

• This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.

• This position requires access to export control and security sensitive information. Therefore the selected incumbent for this position requires U.S. citizenship or U.S. permanent residency.

• Work will be primarily performed at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA.


How To Apply

Apply directly online at and follow the on-line instructions to complete the application process.


Learn About Us:

Working at Berkeley Lab has many rewards including a competitive compensation program, excellent health and welfare programs, a retirement program that is second to none, and outstanding development opportunities.  To view information about the many rewards that are offered at Berkeley Lab- Click Here (https://hr.lbl.gov/).


Berkeley Lab (LBNL, http://www.lbl.gov/) addresses the world’s most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab’s scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the U.S. Department of Energy’s Office of Science.


Equal Employment Opportunity: Berkeley Lab is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status. Berkeley Lab is in compliance with the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4 (https://www.dol.gov/ofccp/PayTransparencyNondiscrimination.html).  Click here (https://www.dol.gov/ofccp/regs/compliance/posters/ofccpost.htm) to view the poster: "Equal Employment Opportunity is the Law".


What is Required:

• Bachelor’s Degree in Computer Science or similar discipline and a total 5 years of experience; or an equivalent combination of education and experience.

• Three years as a Linux (or similar type of operating system) system administrator or system engineer in a customer facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continued availability to the user community. This can include assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents and working with vendors on hardware warranty replacements. 

• Hands­-on experience in Red Hat Enterprise Linux or another Linux variant in a shell or command line environment.

• Minimum of 3 years of experience in a UNIX or Linux environment with Networking, IT infrastructure environment or cluster management experience in a distributed computing environment.

• Hands-on experience with developing tools using various programming languages such as C, C++, Perl, Java, and Python or a scripting language with knowledge of standard software development practices. Feel free to share your GitHub repository for us to look at.

• Networking experience with network theory such as TCP/IP, UDP, ICMP (networking protocols in general), MAC addresses, IP packets, DNS, OSI layers, and load balancing.

• An understanding of the different monitoring implementations and the solution’s system administration.

• Past experience with Incident Management and a proficient understanding of IT service management.

• Exposure to Oracle or other high ­end Storage Infrastructure.

• Background configuring distributed, server­ based or cluster­based infrastructure supporting a high volume of transactions in a Linux environment. An understanding of VM's and Containers, how to manage them and an understanding of the IoT technologies.

• Ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications.

• Motivated, self-starter who can learn emerging technologies that improve data center management in areas like Jupyter, Kibana, Functions as a Service, Kubenetes, building management software, evaporative cooling, an incident management platform and power utilization.


Additional Desired Qualifications:

• Experience with network security: configuring/maintaining ACLs, knowledge of firewalls.

• Network programming or a network certification.

• A certification in a system administration area.

• Ability to provide input toward creating new standards and methods for managing large scale distributed systems.


Posted: 2020-06-03 Expires: 2020-08-02

Featured Jobs

Sponsored by:
ADP Logo

Career News

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Site Reliability Engineer

Lawrence Berkeley National Laboratory
Berkeley, California 94720

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast