Site Reliability Engineering
- Jul 31, 2023
- 4 min read
By Adrian Gonciarz, QA / SRE Director
The origin of Site Reliability Engineering
Site Reliability Engineering (SRE) takes its roots in Google, where the idea was started in 2004 by Ben Treynor Sloss, who was tasked with improving the system’s performance, availability, and stability.
Having a programmer’s background, Ben approached the task as if he were a Site Reliability Engineer and incorporated methods used commonly in code development as opposed to operations. It was still a time when DevOps, culturally, was still young, and the traditional division of developers handled only programming tasks, while the deployment and maintenance were operated by a separate team of administrators. He assigned half of the time of his team members to operational work while the remaining time they concentrated on improving the codebase with tools that would make the software easier to monitor in the production environment. Site Reliability Engineering quickly became a standard approach across the whole organization, and, years after, is one of the most important branches in the IT industry.
If we were to summarize in one sentence the aim of SRE in modern organizations, it would be to provide all necessary means to achieve the required availability (for example 99.99% of the time) of the system.
Who is a Site Reliability Engineer and what do they do?
A Site Reliability Engineer is someone that comes from a very wide background covering system administration, application development, software testing, and business analysis. Their main goal is to work closely with the development and operational (also known as DevOps) teams to improve the resiliency, observability, and overall reliability of the system using programming methods.
Site Reliability Engineers analyze different layers of the system: from the infrastructure of underlying machines and databases where the environment is running through deployment orchestration (Kubernetes), to applications’ memory and CPU consumption, to the highest layers of application functions such as HTTP requests, their latency, and errors.
They also utilize other sources of data such as logs, metrics, and error reporting tools. Pretty much everything that gives them meaningful insight into the system’s health and performance. These are commonly known as the Four Golden Signals of SRE.
There are more sophisticated statistical tools that can exercise data gathered into a mathematical equation to check specific parts of the system against potential outages, namely Service Level Indicators (SLI) and Service Level Objective (SLO). Incorporating these, The engineers can continuously monitor the status of particular endpoints or a whole system with easy-to-read green/yellow/red statuses.
One of the key assumptions of SRE is “zero-toil”. In practice, it means that we, the engineers, want to eliminate the manual aspect of the which can be repetitive, boring, and prone to human error. Everything that can be automated - should be done this way. In a complex system with a huge number of components that can automatically scale up, it wouldn’t be possible to properly implement the necessary mechanisms, if done manually.
How SRE is implemented in Kitopi?
Even though our SRE teams consists of only a few engineers, we make sure that the principles of SRE are strongly imposed all over the components of the system. At the center of our activities lies providing ownership and power to relevant teams. In our case, each development team is responsible for a part of the Kitopi’s architecture and as such, it is critical that every one of them is properly monitored and any potential degradation is quickly picked up and alerted.
We use Dynatrace as a tool for most of our activities. We divided different components of the systems into so-called Management Zones, each zone belonging to a relative team. This way, components owned by a team are separately monitored and alerted to the proper Slack channel. We rely heavily on automatically detected anomalies (such as increased failure rate, slower response times, etc.) but for tracking applications' health we also use SLOs for the most important endpoints of the system. We had a lot of problems with false alarms being triggered due to oversensitive anomaly detection settings, so we cooperated closely with development teams in order to make sure only meaningful problems were picked up and alerted. The time required for a reaction against degradation went down significantly due to building the culture of reliability ownership in teams.
Recently, we’ve been using a new feature of Dynatrace, the Grail Engine, which allows us to use events and logs as sources of analytical information. In other words, we can get meaningful observability input by running computations on logs and events collected upon certain user actions in the system. It gives us a huge advantage in terms of observing trends in quickly fluctuating data.
Summary
Site Reliability Engineering is still a young, dynamically growing discipline of the IT industry. It focuses mainly on improving the reliability of systems by increasing observability, solid alerting, and tools such as SLI and SLO. The important part of the SRE engineers job is automating tasks and learning from past outages. They cooperate closely with developers and DevOps engineers.
At Kitopi we’re proud to have grown a mature culture of SRE in teams that take ownership of the reliability of their applications. And we’re not stopping there!



Great read on SRE! It's cool how Google applied coding practices to ops. Speaking of efficient planning, I found the wizard alchemy wiki helpful for organizing in-game strategies too.
https://tylebong88.com/ mình ghé thử cho biết vì thấy mấy người trong nhóm hay nhắc, chủ yếu tò mò xem trang họ làm kiểu gì. Vào cái là thấy bố cục khá gọn, không bị rối mắt, phần tiêu đề “Chào Mừng Đến WEbsite Của Chúng Tôi” để ngay trên nên đọc một phát là hiểu trang đang nói về gì. Mình lướt xuống thì thấy có mấy ô nội dung dạng “Review / Visit” xếp thành khối, nhìn qua là nhận ra đây là chỗ tổng hợp đánh giá nhà cái kèm thông tin khuyến mãi cơ bản. Không cần bấm nhiều vẫn nắm được ý chính, kiểu hợp với ai chỉ muốn xem nhanh. Nói chung trải nghiệm lướt…
LC88 mình vừa lướt thử mấy phút vì thấy bạn bè nhắc, chủ yếu xem trang trông ra sao chứ chưa vào chơi gì. Cảm giác đầu tiên là bố cục khá dễ chịu, các phần nội dung chia theo từng khối nên kéo xuống không bị “ngợp”, nhìn phát biết đoạn nào là giới thiệu, đoạn nào là thông tin chính. Mình có đọc lướt phần giới thiệu thì thấy họ nhắc nền tảng hoạt động từ 2018 và số thành viên khá đông, nên ít nhất là họ để thông tin nền tảng lên khá rõ. Thanh menu đặt ngay chỗ dễ thấy, bấm qua lại không bị lag hay phải mò. Nói chung kiểu trình bày này…
SC88 mình ghé thử tình cờ thôi, thấy bạn bè nhắc nên vào xem giao diện ra sao. Ấn tượng đầu là trang nhìn khá “thoáng”, chữ nghĩa và các khối nội dung sắp xếp gọn nên không bị rối mắt. Mình lướt một vòng là định hình được chỗ nào là phần giới thiệu, chỗ nào là thông tin về hệ thống, kiểu không phải đoán mò. Có đoạn nói về vận hành và trung tâm quản lý viết ngắn mà dễ hiểu, đọc qua là biết họ muốn nhấn vào chuyện chạy ổn định với bảo mật. Bấm qua lại mấy mục cũng mượt, không thấy bị giật hay load lâu. Nói chung mình thích kiểu chia khối…
QS88 dạo này mình thấy nhiều người nhắc nên cũng ghé thử cho biết, chủ yếu xem giao diện có dễ chịu không. Vào trang cái là thấy load nhanh, bấm qua mấy mục mà không bị đứng hay giật gì nên lướt khá thoải mái. Mình thích kiểu họ làm nội dung gọn gàng, nhìn không bị “ngợp” chữ, đọc lướt vẫn nắm được ý chính về nền tảng với chuyện hệ thống chạy ổn định. Không cần kéo xuống quá sâu mới gặp phần quan trọng, nên cảm giác đỡ mất thời gian. Nói chung trải nghiệm ban đầu khá nhẹ nhàng, không phải kiểu trang nhồi nhét quá nhiều thứ một lúc. Mấy khối nội dung được…