Site Reliability Engineering
- malaikarehman6
- Jul 31, 2023
- 4 min read
By Adrian Gonciarz, QA / SRE Director
The origin of Site Reliability Engineering
Site Reliability Engineering (SRE) takes its roots in Google, where the idea was started in 2004 by Ben Treynor Sloss, who was tasked with improving the system’s performance, availability, and stability.
Having a programmer’s background, Ben approached the task as if he were a Site Reliability Engineer and incorporated methods used commonly in code development as opposed to operations. It was still a time when DevOps, culturally, was still young, and the traditional division of developers handled only programming tasks, while the deployment and maintenance were operated by a separate team of administrators. He assigned half of the time of his team members to operational work while the remaining time they concentrated on improving the codebase with tools that would make the software easier to monitor in the production environment. Site Reliability Engineering quickly became a standard approach across the whole organization, and, years after, is one of the most important branches in the IT industry.
If we were to summarize in one sentence the aim of SRE in modern organizations, it would be to provide all necessary means to achieve the required availability (for example 99.99% of the time) of the system.
Who is a Site Reliability Engineer and what do they do?
A Site Reliability Engineer is someone that comes from a very wide background covering system administration, application development, software testing, and business analysis. Their main goal is to work closely with the development and operational (also known as DevOps) teams to improve the resiliency, observability, and overall reliability of the system using programming methods.
Site Reliability Engineers analyze different layers of the system: from the infrastructure of underlying machines and databases where the environment is running through deployment orchestration (Kubernetes), to applications’ memory and CPU consumption, to the highest layers of application functions such as HTTP requests, their latency, and errors.
They also utilize other sources of data such as logs, metrics, and error reporting tools. Pretty much everything that gives them meaningful insight into the system’s health and performance. These are commonly known as the Four Golden Signals of SRE.
There are more sophisticated statistical tools that can exercise data gathered into a mathematical equation to check specific parts of the system against potential outages, namely Service Level Indicators (SLI) and Service Level Objective (SLO). Incorporating these, The engineers can continuously monitor the status of particular endpoints or a whole system with easy-to-read green/yellow/red statuses.
One of the key assumptions of SRE is “zero-toil”. In practice, it means that we, the engineers, want to eliminate the manual aspect of the which can be repetitive, boring, and prone to human error. Everything that can be automated - should be done this way. In a complex system with a huge number of components that can automatically scale up, it wouldn’t be possible to properly implement the necessary mechanisms, if done manually.
How SRE is implemented in Kitopi?
Even though our SRE teams consists of only a few engineers, we make sure that the principles of SRE are strongly imposed all over the components of the system. At the center of our activities lies providing ownership and power to relevant teams. In our case, each development team is responsible for a part of the Kitopi’s architecture and as such, it is critical that every one of them is properly monitored and any potential degradation is quickly picked up and alerted.
We use Dynatrace as a tool for most of our activities. We divided different components of the systems into so-called Management Zones, each zone belonging to a relative team. This way, components owned by a team are separately monitored and alerted to the proper Slack channel. We rely heavily on automatically detected anomalies (such as increased failure rate, slower response times, etc.) but for tracking applications' health we also use SLOs for the most important endpoints of the system. We had a lot of problems with false alarms being triggered due to oversensitive anomaly detection settings, so we cooperated closely with development teams in order to make sure only meaningful problems were picked up and alerted. The time required for a reaction against degradation went down significantly due to building the culture of reliability ownership in teams.
Recently, we’ve been using a new feature of Dynatrace, the Grail Engine, which allows us to use events and logs as sources of analytical information. In other words, we can get meaningful observability input by running computations on logs and events collected upon certain user actions in the system. It gives us a huge advantage in terms of observing trends in quickly fluctuating data.
Summary
Site Reliability Engineering is still a young, dynamically growing discipline of the IT industry. It focuses mainly on improving the reliability of systems by increasing observability, solid alerting, and tools such as SLI and SLO. The important part of the SRE engineers job is automating tasks and learning from past outages. They cooperate closely with developers and DevOps engineers.
At Kitopi we’re proud to have grown a mature culture of SRE in teams that take ownership of the reliability of their applications. And we’re not stopping there!



Throw a potato
Forget logic. Forget reason. Embrace the madness! In this game, the only way forward is to throw a potato... and then another... and another. Experience an endless, chaotic loop of pure, spud-flinging insanity. You have been warned.
The freak circus
A Psychological Horror Visual Novel Step into a world where reality bends and nightmares come alive. The Freak Circus is an immersive psychological horror visual novel that will challenge your perception of truth and fiction.
Escape
The ultimate one-stop-shop for Escape from Duckov. From detailed guides to a comprehensive wiki and community tools, it has everything you need to survive the pond. Crafted by players, for players
Vein game
VEIN game is a post-apocalyptic survival multiplayer sandbox game. Gather supplies, explore abandoned buildings, combat bandits, defend your home, and rebuild society—whether alone or with friends. As seasons change,
Fish it
Fish It!, built by the Fish Atelier team, is one of Roblox's most-loved fishing simulators. Hunt rare fish, sail across island chains, upgrade rods and bobbers, and help curate the official wiki. Stack Luck, Mutation, and Shiny chances to become the legendary angler every island talks about. Use our calculators to optimize your strategy and explore the wiki for detailed equipment stats.Curated walkthroughs from the Roblox community: progression routes, Luck builds, and boat investments. These guides help you understand game mechanics, optimize your gear setup, and plan your fishing strategy. Got more tips? Drop them on the wiki.