March 25, 2022 / Cafeto

Site reliability engineering (SRE) is an engineering discipline that focuses on helping organizations reach a suitable reliability level in their systems, services, and products. A site reliability engineer creates dependable, ultra-scalable software systems by applying software engineering to infrastructure and operational issues (Microsoft, n.d.).

SRE’s primary goals are to ensure a correct system implementation and offer users a reliable service. It covers software problem management occurring after implementation and the system’s error budget. SRE concentrates on creating reliable systems and guaranteeing efficiency (Hall, 2021).

Discover what a site reliability engineer can contribute to a company and their main differences with DevOps. Also, learn more about the tools they use.

What can a site reliability engineer contribute to a company?

A site reliability engineer can design solutions to foster balance between the operations and development teams. Usually, they set a Service Level Agreement (SLA), which defines how reliable a system is for its end users. A site reliability engineer measures Service Level Indicators (SLI) and Service Level Objectives (SLO) through SLAs. Moreover, they serve as communication leaders to ensure a problem-free SRE workflow development and evolution (NetApp, n.d.; Lanas, 2019).

Essentially, these engineers dedicate their time to automating and improving software reliability. They aim to reduce manual work, so infrastructure and software are both efficiently developed. Consequently, they contribute to supporting the elements of live software. Some of the responsibilities of a site reliability engineer inside a company are (Hall, 2021; Treynor, n.d.):

  • Determining code implementation, as well as its configuration and supervision.
  • Planning service capacity.
  • Identifying software conditions, availability, and latency.
  • Managing arising changes and responding to emergencies.
  • Developing and configuring tools toward DevOps practices and overall culture.
  • Measuring performance and efficiency. Plus, monitoring operations.
  • Providing on-call services to ensure an expert is always available.

The difference between SRE and DevOps

DevOps is a combination of development and operations. It coordinates the merging of people, processes, and technology to offer value to clients in a constant fashion. Doing so enables previously isolated roles (like development, IT operations, quality engineering, and security) to work together to produce better, more reliable products (Microsoft, n.d.).

While SRE and DevOps are both operational practices, they do diverge. Let’s take a closer look at site reliability engineers vs. DevOps and their differences (Microsoft, n.d.; Lanas, 2019):

  • SRE is an engineering discipline that focuses on reliability. On the other hand, DevOps is a practice that arose from the need to associate independent operations inside organizations.
  • Moreover, SRE provides an established role: site reliability engineer.In contrast, DevOps belongs to an organization’s corporate culture.
  • Usually, SRE is more normative. DevOps tends not to be –at least intentionally. Instead, it adopts integration principles.
  • DevOps focuses on training developers to build and manage services. After implementation, SRE monitors applications or services using automation to improve a system’s overall state and availability.
  • SRE is a more scalable approach for the continuous development and improvement of complex frameworks. Meanwhile, DevOps is ideal for frequent product and digital codes launches.

The best SRE tools used in companies

A site reliability engineer’s software tools and solutions may vary from organization to organization. For instance, a large company needs more people in an SRE team, usually with segmented responsibilities. In contrast, a small company tends to have limited staffing, so the chosen SRE tool should be all-inclusive (Hall, 2021).

Some popular solutions a site reliability engineer can use today include (Hall, 2021):

  • NetApp for storing environments.
  • PagerDuty, to respond to incidents.
  • Python for programming languages.
  • Docker for containerization and microservices.
  • Terraform for configuration management.
  • Prometheus and Kibana for monitoring and analysis.

Google SRE

Google is the SRE pioneer. Their engineering vice president, Ben Treynor, decided to drive programming specialists to execute operations tasks.  And like that, the Google SRE position was born, and with it, the role of a site reliability engineer. But what does this innovative solution consist of? The idea is to designate an IT workforce responsible for the stability of the production environment and the improvement of performance (Lanas, 2019).

Google’s approach suggests that SRE teams hire software engineers to execute products and create systems to perform tasks that system administrators often do manually. Instead of having an operations team made only by these administrators, R&D-minded and experienced software engineers can switch their goals and help to automate solutions (Treynor, n.d.).

By doing so, Google’s SRE teams are made up of Google software engineers (50-60%) and software engineers (40-50%). The latter need to provide valuable technical skills for SRE, such as experience with internal systems and UNIX networks. Google sets a 50% limit for operational work and 50% for development work (Treynor, n.d.).

How much does a site reliability engineer cost in Latin America and the United States?

The average wage for a site reliability engineer in Latin America and the United States heavily depends on each company and its expectations. The type of role and contract they offer also influences salary, as many organizations choose to outsource to optimize costs.

A site reliability engineer is critical to improving system reliability and keeping the balance between operations. They make the most of technology and automation to add value to applications and development teams by removing manual tasks. Google originated the role to change production management by using reliability teams that close the gap between operations and software developers.