Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more. read.srepath.com
Tue, January 28, 2025
Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions. It's been a hot minute since the last episode of the Reliability Enablers podcast. Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather. Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control. (To the skeptics, he did not pay me for this episode) Here’s an AI-generated summary of what you can expect in our conversation: In this conversation, we explore cutting-edge approaches to FinOps i.e. cost optimization for observability. You'll hear about three pressing topics: * Managing Tool Sprawl : Insights into the common challenge of juggling 5-15 tools and how to identify which ones deliver real value. * Reducing Observability Costs : Techniques to track and trim waste, including how to uncover cost hotspots like overused or redundant metrics. * AI for Observability Decisions : Practical ways AI can simplify complex data, empowering non-technical stakeholders to make informed decisions. We also touch on the balance between open-source solutions like OpenTelemetry and commercial observability tools. Learn how these strategies, informed by Ruchir's experience at Netflix, can help streamline observability operations and cut costs without sacrificing reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, November 12, 2024
Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role : Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered. * Specialization in Mobile Observability : At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows. * Gap in Mobile Observability : Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed. * Motivation for User-Centric Tools : Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users. * Mobile's Role as a Brand Touchpoint : He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance. * Challenges in Measuring Mobile Reliability : Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability. * Observability Over-Focused on Backend Systems : Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions. * Lack of Mobile Engineering Leadership in Observability : Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams. * OpenTelemetry’s Limitations in Mobile : While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where
Tue, November 05, 2024
Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day. Here’s a slightly deeper dive into the concepts we discussed: * Career and Evolution in Tech : Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it. * Building Prodvana and the Future of SRE : As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it. * Challenges of Migration and Integration : Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube’s infrastructure onto Google’s proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google’s engineering approach at that time. * SRE’s Shift Toward Reliability as a Core Feature : The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices. * Organizational Culture and Leadership Influence : Leadership’s role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response. * Outcome-Focused Work Over Titles : Emphasis on assembling the right team based on skills
Tue, October 22, 2024
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, October 01, 2024
Here’s what we covered: Defining Platform Engineering * Platform engineering : Building compelling internal products to help teams reuse capabilities with less coordination. * Cloud computing connection : Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas. Ankit’s career journey * Didn't choose platform engineering; it found him. * Early start in programming (since age 11). * Transitioned from a product engineer mindset to building internal tools and platforms. * Key experience across startups, the public sector, unicorn companies, and private cloud projects. Singapore Public Sector Experience * Public sector : Highly advanced digital services (e.g., identity services for tax, housing). * Exciting environment : Software development in Singapore’s public sector is fast-paced and digitally progressive. Platform Engineering Turf Wars * Turf wars : Debate among DevOps, SRE, and platform engineering. * DevOps : Collaboration between dev and ops to think systemically. * SRE : Operations done the software engineering way. * Platform engineering : Delivering operational services as internal, self-service products. Dysfunctional Team Interactions * Issue : Requiring tickets to get work done creates bottlenecks. * Ideal state : Teams should be able to work autonomously without raising tickets. * Spectrum of dysfunction : From one ticket for one service to multiple tickets across teams leading to delays and misconfigurations. Quadrant Model (Autonomy vs. Cognitive Load) * Challenge : Balancing user autonomy with managing cognitive load. * Goal : Enable product teams with autonomy while managing cognitive load. * Solution : Platforms should abstract unnecessary complexity while still giving teams the autonomy to operate independently. How it pans out * Low autonomy, low cognitive load : Dependent on platform teams but a simple process. * Low autonomy, high cognitive load : Requires interacting with multiple teams and understanding technical details (worst case). * High autonomy, high cognitive load : Teams have full access (e.g., AWS accounts) but face infrastructure burden and fragmentation. * High autonomy, low cognitive load : Ideal situation—teams get what they need quickly without detailed knowledge. Shift from Product Thinking to Cognitive Load * Cognitive load focus : More important than just product thinking—consider the human experience when using the system. * Team Topologies</str
Tue, September 24, 2024
Why many copy Google’s monitoring team setup * Google’s Influence. Google played a key role in defining the concept of software reliability. * Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settings BUT there’s a problem: * It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team. What is Google’s monitoring approach within teams? Here’s the thing that Google does: * Google assigns one or two people per team to manage monitoring. * Even with centralized infrastructure, a dedicated person handles monitoring. * Many organizations use a separate observability team, unlike Google's integrated approach If your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee. Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective. Can your team mimic Google’s model? Here are a few things you should factor in: Size matters Google's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this. What are the options for your team? Dedicated monitoring team (very popular but $$$) If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify. Dedicate SREs to monitoring work (effective but difficult to manage) You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning. Internal monitoring experts (useful but hard capability) One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams. Transitioning monitoring from project work to maintenance 2 distinct phases Initial Setup (the “project”) SREs may help set up the monitoring/observability infrastructure. Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively. Post-project phase (“keep the lights on”) Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that? Who will maintain the monitoring system? Answer: usually not the same team After the project phase, a new set of peop
Tue, September 17, 2024
Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep. Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise. When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard. Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit. When instrumenting your systems, be intentional about what data you collect and transport. Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem. To combat this, focus on: * Being Deliberate with Data . Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals. * Filtering Data Effectively . Improve how you filter incoming data to eliminate less relevant information and retain what's crucial. * Refining Alerts . Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings. Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently. He shared that managing millions of alerts, often filled with noise, is a significant issue. His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey . According to Dan, the anatomy of a good alert includes: * A run book * A defined priority level * A corresponding dashboard * Consistent labels and tags * Clear escalation paths and ownership To elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes. The learning point is simple: aim for quality over quantity. By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, September 10, 2024
The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others. She and I discussed the link between this role and software reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Wed, September 04, 2024
We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas. Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations. Nathen Harvey is no stranger to this problem. He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018. His focus has been on questions like: How do we help teams get better at delivering and operating software? You and I can agree that this is an important question to ask. I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software. Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words: DORA is a software research program that's been running since 2015. This research program looks to figure out: How do teams get good at delivering, operating, building, and running software? The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes . They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction. Essentially, all those things that matter to the business. One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery? It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily. The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code. Imagine being asked constantly: “How many lines of code did you write this week?” You might not have to imagine. It might be a reality for you. DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer. They settled on and validated 4 key measures for software delivery performance. Nathen elaborated that 2 of these measures look at throughput : [Those] two [that] look at throughput really ask two questions: * How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production? And then the second question on throughput is: * How frequently are you updating production? In plain English, these 2 metrics are: * Deployment Frequency . How often code
Tue, August 27, 2024
We’ll explore 3 use cases for monitoring data. They are: * Analyzing long-term trends * Comparing over time or experiment groups * Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point: * How big is my database? * How fast is the database growing? * How quickly is my user count growing? As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like: * How is the database performance evolving? Are there signs of degradation? * Is there consistent growth in data volume that may require future infrastructure adjustments? * How is overall resource utilization trending over time across different services? * How is the cost of cloud resources evolving, and what does that mean for budget forecasting? * Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them? Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It’s exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare. Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work. Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period. Comparing over time or experiment groups Google’s definition You're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book: * Are your queries faster in this version of this database or this version of that database? * How much better is my memcache hit rate with an extra node and is my site slower than it was last week? You're comparing it to different buckets of time and different types of products. A proper use case for comparing groups Sebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS. He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively. His need for this went beyond engineering work but enabling product teams with the right decision-making data . He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service. Having the data to compare
Tue, August 20, 2024
Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer. I’ve broken them down into the following areas: * Avoid the heroic efforts * Mind + heart > Mind alone * Curiosity > Credentials * Experience > Certifications * Thinking for complexity When I saw him in Toronto, I thought he would talk about pre-production observability. It would only make sense after watching the previous presenter do a deep dive into Kubernetes tooling. But surprisingly, he started about culture and the need to prevent burnout among engineers — a topic that is as important today as it was 2 years ago when he did the talk. Here’s a look into Shlomo’s philosophy and the practices he champions. Avoid the heroic efforts Shlomo's perspective on heroics in engineering and operations challenges a traditional mindset that often glorifies excessive individual efforts at the cost of long-term sustainability. He emphasizes that relying on heroics — where individuals consistently go above and beyond to save the day — creates an unhealthy work environment. "We shouldn't be rewarding people for pulling all-nighters to save a project; we should be asking why those all-nighters were necessary in the first place." This approach not only burns out engineers but also masks underlying systemic issues that need to be addressed. So, instead of celebrating these heroic efforts, Shlomo advocates for creating processes and metrics that ensure smooth operations without the need for constant intervention. Mind + Heart > Mind alone One of the challenges Shlomo has faced recently is scaling his engineering organization amidst rapid growth. His approach to hiring is unique; he doesn’t just look for technical skills but prioritizes self-awareness and kindness. "Hiring with heart means looking for individuals who bring empathy and integrity to the team, not just expertise." When he joined The Score, a subsidiary of Penn Interactive, Shlomo immediately revamped the hiring practices by integrating the values above into the process. He favors role-playing scenarios over solely using behavioral interviews to evaluate candidates, as this method reveals how individuals might react in real production situations. I tend to agree with this approach as seeing how people are doing the work is more enlightening than asking them how they behaved in a past situation alone. Curiosity > credentials How it plays into career progression When it comes to career progression, Shlomo places little value on traditional markers like education or years of experience. Instead, he valu
Thu, August 15, 2024
Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust. Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting. According to Vladislav, at the beginning of your SRE journey, it’s not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc. And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization. Understanding and Leveraging SLOs Once your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they’ve been validated through production. Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively. Implementing a Formal Incident Response Before you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place. Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness. Coordinating During Major Incidents When a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams. Consider appointing incident commanders and coordinators, as recommended in PagerDuty’s documentation , to manage this coordination. Develop a lightweight process to guide how incidents are handled. Classifying Incidents Establish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three. Due to the inherently fuzzy nature of incidents, your classification sys
Tue, August 13, 2024
According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale. However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function. Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people globally to offer their cloud platform running off Microsoft Azure. We discussed key concepts from his book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations . Unlike other technical books in this field, Dr Ukis’ book is aimed at technology professionals who are beginners to the reliability journey . This is different from the Site Reliability Engineering (2016) book by Google, which covers all the bells and whistles that SRE encompasses. That book requires a degree of prior knowledge and also prior experience in the field. Vlad wanted to make it more accessible: What I did with my book is to say, ‘Okay, so now you've never done operations, but you now are thrown in the world of online services where you have to operate them. How do you get started?’ So this is what the book is for. So for people who want to learn how to get started in the world of operating online services. ITIL was originally developed by the UK government in the 80s to improve IT governance. It is best related to SRE through its service management and incident management components. But it’s for managing systems that are more predictable and can be handled through strict process control. Modern product delivery doesn’t have the luxury of bureaucratic levels of predictability that older IT services have. It requires a more engineer-oriented approach to solving problems/incidents and providing services. So how was Vlad’s experience bringing SRE into an organization that previously had run solely on the ITIL model? Siemens Healthineers for many years operated like a traditional software development organization. In other words, they were developing on-prem software, not cloud software. The company would ship the physical software product to its hospital customers and then those hospitals would have the software operated and supported by their IT departments. The change came about when Siemens Healthineers began to work on a new digital health platform, which would be cloud-based from the beginning. So they would no longer ship physical software in discs to customers, but provide online services in the cloud c
Tue, August 06, 2024
Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is! Our systems are becoming more complex and so are the resulting incidents . Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this episode. We'll explore the causes of complexity in incidents and how the Cynefin framework classifies incidents. We'll also deep dive into the concept of complexity itself and dispel a common issue where it gets mixed up with complicatedness. About Sonja Sonja is a co-founder of Complexity Fit and founder of More Beyond focusing on helping teams build capacity for sensemaking, collaboration, and wayfinding. She has a background in programming from her early career as a meteorologist, having worked in C and Fortran, and then progressing to working as a web developer. You can connect with Sonja to learn more about complexity via LinkedIn . This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, July 30, 2024
Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitoring. It's not the same as internal versus external monitoring, which we'll explore further. We'll cover topics like: - (quickly) What is monitoring? - What is whitebox monitoring? - What is black box monitoring? - The rising importance of blackbox monitoring This is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) book. Chapter written by Rob Ewaschuk and edited by Betsy Beyer. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, July 09, 2024
Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data. We crammed into just under 25 minutes ideas like these 7 takeaways: * Reasserting the Need to Monitor Four Golden Signals : Focus on latency, traffic, errors, and saturation for effective system monitoring and management. * Prioritize Customer Health : in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers for a more comprehensive view of your system's impact. * Apply Mathematical Techniques : Incorporate advanced mathematical concepts, like the Nyquist Shannon law and T Digest algorithm, to enhance data accuracy and observability metrics. * Build Accurate Percentiles : Implement techniques to accurately reproduce percentiles from raw data to ensure reliable performance metrics. * Manage High Cardinality Data : Develop strategies to handle high cardinality data without overwhelming your resources, ensuring you extract valuable insights. * Standardize Log Records : Use readily available frameworks to emit standardized log records makes data easier to process and visualize. * Handle High-Velocity Data Efficiently : Develop methods for collecting and processing high-velocity data without incurring prohibitive costs. Watch Jack’s Monitorama talk via this link: https://vimeo.com/843996971 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, July 02, 2024
Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topic. He also happens to be an avid skateboarder! Here are 9 takeaways from our conversation: * Regularly Review and Update Monitoring Systems : Don’t set up monitoring once and forget about it. Continuously assess and update your monitoring systems to ensure they remain relevant and effective. * Focus on Relevant Alerts : Ensure your alerting system is tailored to indicate real problems. Avoid relying on outdated criteria such as high CPU or memory usage unless they directly impact user experience. * Adopt a User-Centric Approach : Develop alerts based on how issues affect the user experience rather than purely technical metrics. This helps prioritize what truly matters to the end user. * Evaluate Alert Value : Critically assess each alert for its value. Ask whether the alert provides actionable information and if it impacts the user or business. Eliminate or adjust alerts that don’t meet these criteria. * Reduce Alert Noise : Strive to minimize unnecessary alerts contributing to noise and obscure real issues. This makes it easier to detect and respond to genuine problems. * Understand the User Journey : Document the user journey and create Service Level Objectives (SLOs) to align alerts with user-impacting events. This ensures alerts are meaningful and actionable. * Secure Leadership Support : Gain buy-in from leadership by demonstrating the long-term benefits of an effective alerting system. Emphasize how it can improve user satisfaction and operational efficiency. * Improve Documentation and Preparedness : Ensure thorough documentation for all systems and alerts. This reduces stress and increases efficiency, particularly for engineers handling on-call duties. * Automate Alert Responses : Implement automation to handle routine alerts. This reduces the manual burden on engineers and allows them to focus on more complex issues. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, June 25, 2024
Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil. We hit the jackpot with concepts like: * what is toil according to a 5-point criteria * why even care about toil? * where you can find toil in your software system * Google’s goal for how much work (%) should be toil * the fact that toil isn’t always all that bad Don’t have time to listen to what we learned or added to the concepts? Check out the takeaways toward the end of this email. But first… Before we jump into the takeaways, here’s a new segment I’m trying out for newsletters. I’ll highlight a new reliability tool that I think could help you. Do you struggle to visualize your Kubernetes workloads? In that case, have you heard of kube-ops-view? It helps you visualize your complex K8s clusters and everything inside them. For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details. Back to our original programming… Here are key takeaways from our chat * Define and Identify Toil Regularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction. * Prioritize Automation Look for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency. * Embrace the Role of an SRE Realize that the role of an SRE is to improve system reliability proactivel y. Focus on long-term improvements rather than just responding to immediate issues. * Address Common Sources of Toil Identify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas. * Adopt a Toil Elimination Mindset Cultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes. * Develop a Culture of Continuous Improvement Encourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams. Until next time, happy toil hunting! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, June 18, 2024
The common refrain after an incident is “We could and should learn from this” . To me, that alludes to the need for a robust learning culture . We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives. But how often do we explore the nuances of how we are learning? Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under her own banner. Her work ties in well with the ideas shared by Manuel Pais in episode #45 about how enabling teams can support a continuous learning culture. We tackled issues like the value of certifications, comparing technical with non-technical skills, and more. You can connect with Sorrel via LinkedIn Learn more about what Sorrel does via LaaS.consulting Here’s a bonus section because you read all this way. It covers 5 public outages and how the affected teams could improve their learning culture: 1. Slack Outage (February 2023) Slack experienced a global outage disrupting communication for hours due to backend infrastructure issues. Perhaps the team could focus their learning on more robust infrastructure management and resilience improvement. 2. Twitter Algorithm Glitch (April 2023) A glitch in Twitter's algorithm caused timeline issues, stemming from a problematic software update. Perhaps the team could focus their learning on thorough testing and game days to rectify critical system errors swiftly. 3. Microsoft Azure AD Outage (March 2023) Azure Active Directory faced a significant outage due to an internal configuration change. Perhaps the team could focus their learning on the importance of rigorous change management and how to address misconfigurations quickly. 4. Google Cloud Platform Networking Issue (May 2023) Google Cloud Platform experienced widespread service disruptions from a software bug in its networking infrastructure. Perhaps the team could focus their learning on the need for comprehensive testing and preventing disruptions. 5. GitHub Outage (June 2023) GitHub suffered a major outage caused by a cascading failure in its storage infrastructure. Perhaps the team could focus their learning on robust fault-tolerance mechanisms and ways to address the root causes of failures. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit <a hre
Tue, June 11, 2024
I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams. In this second part, we will talk about platform teams. A quick refresher on what platform teams do In the team topologies context: Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity. They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim to allow stream-aligned teams to focus on delivering business value. Here are the key takeaways from our conversation For those who don’t have time to listen to this episode (but you’re missing out on a great conversation): * Focus on User-Centric Design : Prioritize the user experience in platform development. Regularly collaborate with internal teams to ensure the platform meets their needs and reduces their pain points. * Build and Maintain Trust : Establish and nurture trust with your platform’s users. Trust is crucial for platform adoption and can prevent resistance thus assuring sustained use. * Justify Platform Value : Continuously demonstrate the value of your platform to management and stakeholders, especially during economic downturns. Highlight its contributions to avoid cuts and maintain support. * Understand Adoption Lifecycle : Recognize that platforms go through different stages of adoption. Identify and support early adopters, and gradually bring in late adopters by showcasing successful use cases. * Enhance Collaboration : Foster open communication between platform teams and other teams. Avoid rigid roadmaps and be adaptable to changing needs to prevent barriers and build stronger internal relationships. * Manage Cognitive Load : Be mindful of the cognitive load on your teams. Simplify processes and reduce unnecessary complexities to enhance productivity and efficiency. * Use Tools to Measure Cognitive Load : Implement tools like Teamperature to assess the cognitive load on your teams regularly. Use the insights to identify and mitigate factors contributing to cognitive overload. * Leverage Experienced Product Managers : Ensure experienced product managers are part of your platform team. They can balance long-term goals with the flexibility needed to adapt to the evolving needs of internal users. I think the uncommon takeaway here is #9 in that platform teams should treat their platform as a product. Product Managers like Paweł Huryn and <a target="_blank" href="https://www.svpg.com/articles/"
Tue, June 04, 2024
I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams. A quick refresher on what enabling teams do In the team topologies context: Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas. This kind of team is available to provide expertise, guidance, and support to other teams working to adopt new technologies, practices, or skills. In other news… This podcast has a new name What more a fitting moment to announce renaming the SREpath podcast to “The Reliability Enablers” podcast? This name change reinforces our quest to demystify and enable reliability efforts so that more organizations successfully implement SRE principles and beyond. Before we get to the 8 takeaways Here’s something relevant to enabling reliability work — a reliability workflows map I’ve had in my private notes for years, now going public. What is a workstream? 🤔 You might have heard of “value streams”. They show the end-to-end journey of creating and delivering value to a customer. Workstreams support your value streams. They cover the activities carried out to do so. In summary: Value streams are the goals and workstreams are the activities you do to achieve those goals. Okay, now time for the erudite takeaways that Manuel gave me from our talk. Takeaways from the episode Here are the key takeaways from our conversation for those who don’t have time to listen (but you’re missing out on a great audio conversation): * Create Enabling Teams : Form SRE-focused enabling teams to facilitate technical training, optimize cloud architecture, improve documentation, and overall help other teams build their capabilities. * Work to Minimize Cognitive Load: Minimize the cognitive load on engineers by centralizing complex and repetitive tasks, allowing engineers to concentrate on innovation and high-value work. You can measure cognitive load and manage it through the Teamperature tool * Facilitate Learning and Adoption of Best Practices : Use SRE enabling teams to educate product teams on critical practices like error budgets and service level objectives, making the learning process gradual and manageable. * Collaborate among Topologies for Effective Tooling : Enable teams should work with platform teams to inform their plans to develop and co-evolve tools and services that support reliability and observability practices, like automated dashboards and alerting systems. * Adapt Appro
Thu, May 30, 2024
Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, May 28, 2024
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs. Here are 5 takeaways from the show: * Start Small with SLOs : Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once. * Defend and Enforce SLOs : Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability. * Continuous Improvement : Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success. * Effective Communication Skills : Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate. * Understanding User Needs : Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, May 21, 2024
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show: * Involve Technical Stakeholders Early : Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service. * Differentiate Between SLAs and SLOs : Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives. * Prioritize Meaningful Metrics : Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance. * Align with Customer Expectations : Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences. * Avoid Alert Fatigue : Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most. * Start Top-Down with SLIs : Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs. * Prepare for Deep Dives : Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, May 14, 2024
No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders. Sofia Fosdick shares practical insights on curbing high observability costs. She’s a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode! We tackled the cost issue by covering ideas like aligning cost with value, event-based systems, and dynamic sampling. You will not want to miss this conversation if your observability bill is starting to look dangerous. You can connect with Sofia via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, May 07, 2024
Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He’s a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can connect with Timothy via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, April 30, 2024
Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages. He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously. Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation! You can connect with Ananth via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, April 23, 2024
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all. Here are key takeaways from our conversation: * Prioritize Risk Mitigation : Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively. * Consider Cost-Effectiveness : When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to your organization. * Advocate Continuously : Continuously advocate for the importance of reliability engineering within your organization. Communicate transparently about the value SRE teams add and the impact of their work on the organization's success. * Explore Alternative Metrics : Explore alternative availability metrics beyond traditional time-based measurements. Consider event-based metrics to gain a more nuanced understanding of service availability and performance. * Embrace Regional Focus : Shift from relying solely on global availability metrics to more granular regional metrics. Understand the varying impacts on different customer audiences and prioritize improvements accordingly. * Navigate Regulatory Challenges : Be mindful of regulatory challenges, such as GDPR, and understand their implications on service availability and reliability. Adapt strategies and solutions to comply with regulations while maintaining operational efficiency. * Align Reliability with Revenue : Recognize the direct correlation between service availability and revenue generation, particularly for revenue-driven services like ad platforms. Invest in reliability engineering to ensure consistent revenue streams. * Tier Services Strategically : Implement a tiered approach to prioritize reliability efforts, with revenue-generating services like ad platforms placed in the top tier. Allocate resources based on the criticality of services to the organization's objectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, April 16, 2024
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective. We'll cover how it's very different to the typical IT risk management mindset. Here are key takeaways from our conversation: Embrace Risk with Velocity : Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation. Reevaluate Risk Management Approaches : Challenge traditional approaches to risk management, especially in larger organizations with extensive governance procedures. Explore alternative methods that prioritize agility and efficiency without compromising reliability. Conceptualize Risk as a Continuum : View risk as a continuous spectrum and assess it based on various dimensions, such as the complexity of changes, the criticality of systems, and the impact on user experience. Continuously evaluate and adjust risk management strategies accordingly. Balance Stability and Innovation : Recognize that extreme reliability comes at a cost and may hinder the pace of innovation. Aim for an optimal balance between stability and innovation, prioritizing user satisfaction and efficient service operations. Implement Service-Level Objectives (SLOs) : Deliver services with explicitly delineated levels of service, allowing clients to make informed risk and cost trade-offs when building their systems. Define SLOs based on the importance and criticality of services to enable better decision-making. Visualize Risk Assessment : Utilize visual representations, such as whiteboard diagrams, to assess and communicate different levels of risk within your software systems. Encourage collaborative discussions among team members to determine acceptable risk levels. Prioritize Customer Impact : Consider the impact of changes on customer experience and prioritize risk management efforts accordingly. Differentiate between critical user journeys and cosmetic changes to allocate scrutiny appropriately. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, April 09, 2024
Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell us how to achieve greater maturity in this aspect of software operations. She's previously held SRE roles and currently works as Principal Engineer at Syntasso, the company behind the popular Kratix platform framework. Abby highlighted the need for concrete definitions and maturity models in platform engineering trends, cautioning against equating developer portals with fully functional platforms. We also dived into the need to understand your socio-technical landscape with an emphasis on the value of frameworks and method-based approaches. You can connect with Abby via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, April 02, 2024
The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected? Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data. He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space. Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights. You can connect with Richard via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, March 26, 2024
This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem, which is analogous to that other debate we have in the technology of build vs buy. Here are key takeaways from our conversation: Adapt your storage solutions to business needs: Understand the diverse storage options available and tailor them to specific business needs, considering factors like data type, access patterns, and scalability requirements. Optimize your load balancing: Implement global load balancing strategies to optimize user experience and performance by directing traffic to the nearest data center to minimize latency, and maximize resource utilization. Don't hesitate to continuously evaluate your cloud: Assess the suitability of cloud solutions against your organization's needs, considering factors like cost, control, scalability, and security, and be open to reevaluating decisions based on evolving requirements. Make strategic decisions for your operations footprint: Lean on decisions based on thorough analysis that considers: Encourage objective evaluation and formal planning processes in decision-making : avoid emotional reactions or being swayed by external influences, to ensure decisions are based on sound analysis and truly aligned with organizational goals. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, March 19, 2024
This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design outlined in the book. One thing is for sure. Building a data center for your own needs is HARD work with many considerations you must make. Here are key takeaways from our conversation: Importance of understanding data center fundamentals : Even if you're not operating at the scale of companies like Google, understanding the fundamentals behind data center infrastructure can help. This knowledge can inform decisions on cloud services, high availability strategies, and the architectural design of systems to ensure resilience and scalability. The impetus to leverage cloud infrastructure: The transition from traditional on-premises infrastructure to cloud-based solutions is a critical trend. Organizations can learn from how tech giants manage resources efficiently at scale, to improve their resource allocation. Cyclical trends in technology adoption : trends in technology are cyclical and that can inform strategic decisions. As there's a current discussion around moving from cloud-centric models back to more traditional data center approaches, understanding the history and evolution of tech infrastructure can prepare organizations to adapt to and anticipate future shifts in the technological landscape. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, March 14, 2024
Will Platform Engineering replace DevOps or SRE or both? I don’t think this is the case at all. Neither does Ajay Chankramath. He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. I’d take his word for it since he’s held senior leadership roles in release engineering and more since 2002. In this bonus episode of the SREpath podcast, Ajay shared his perspective on the debate about SRE vs DevOps vs Platform Engineering. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, March 12, 2024
FinOps is on the tip of many tongues in the software space right now, as we try to curb our cloud costs. Ajay Chankramath has given talks on FinOps at conferences like the DevOps Enterprise Summit (DOES) among others. He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. His peers like Martin Fowler and Neal Ford have originated ideas like refactoring, microservices, and more. He shared practical advice for avoiding a harsh, restrictive cost control approach and instead taking a holistic financial view of your software operations. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, March 07, 2024
Observability is going through interesting times. David Caudill believes that delusions are getting in the way of our success in this area. He's a senior engineering manager at Capital One, a US-based bank. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, February 27, 2024
Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: Monitoring is one of the primary means by which service owners keep track of a system's health and availability. Efficient use of resources is important anytime a service cares about money. Humans add latency, even if a given system experiences more actual failures. A system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands on intervention. SRE has found that roughly, 70 percent of outages are due to changes in a live system. Best practices in this domain use automation to accomplish implementing progressive rollouts. Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand, the required availability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, February 20, 2024
Sebastian and I got together to react to and discuss 5 passages from Chapter 1 of Google's Site Reliability Engineering book (2016) by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: The sysadmin approach and the accompanying development ops split have a number of disadvantages and pitfalls Google has chosen to run our systems with a different approach. Our Site Reliability Engineering teams focus on hiring software engineers to run our products The term DevOps emerged in industry. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions. Google caps operational work for SREs at 50 percent of their time. Their remaining time should be spent using their coding skills on project work. Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, February 13, 2024
Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, February 08, 2024
In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 more truths that are somewhat trickier... Background music credit: Luna by KaizanBlue This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, January 30, 2024
DORA metrics are a hot topic among technology executives in all kinds of enterprise. But there's more to engineering culture than solely relying on the numbers it goes you. We have a rare treat for you because Ash got Tim Wheeler on the pod. He doesn't do much of social media or podcast episodes. Tim is Director of Engineering Excellence at SquaredUp where he follows the DORA metrics but emphasizes starting conversations around them rather than setting directives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, January 23, 2024
How can you grow as an SRE? You've probably thought about your career progression at some point. Ash put together his initial thoughts on this topic. Listen on to learn how he unpacks the first idea of "You don't get promotions with tenure". This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, January 16, 2024
Jade Rubick needs no introduction in the reliability and observability space. He was VP of Engineering at New Relic from 2010 to 2019. It was my pleasure to take on his non-obvious ideas on managing expectations with teams, especially platform-based teams. We had a few spicy ideas to dive into. We also touched on topics like enhancing engineering practices, DORA metrics, and so much more. Be sure to listen all the way through to learn Jade's amazing insights. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, January 09, 2024
I did not know that Google itself does consulting around its SRE practices. This is not a sponsored episode LOL! I wanted to talk with my SRE friend, Yury Niño Roa, about her drawings and SRE ideas, but we dove into a whole lot more than that. We spoke about her work at Google's PSO office, the antipatterns she's seen, and a whole lot more. Listen in for an engaging conversation. You can follow Yury and her amazing drawings via: https://www.linkedin.com/in/yurynino/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, January 02, 2024
Sebastian is back for this episode to help set out direction for 2024. We reflected during the holidays on the problems SREs faced in 2023 in terms of job insecurity, burnout, and "that really shouldn't be my sole job". Sebastian and I talked about what we hope to bring to the community in 2024 to make SREs and SRE teams stronger, happier, and healthier at their work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, December 19, 2023
Join Ash Patel and Stephen Townshend for a friendly chat about what they've learned in SRE as 2023 comes toward a wrap! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, December 12, 2023
Ash Patel talks with John Hyland who ran the Ignite Program at New Relic, which is dedicated to developing early career engineers. John shares insights about driving better outcomes for the organization and the early career professionals who join them. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, December 05, 2023
Ash Patel talks with Troy Koss who is the Director of SRE at CapitalOne, an early adopter of DevOps and SRE in the banking sector. He shares insights on working in regulated industries like banking telecom with his early work experience being at Verizon, a US telecom. Troy shares his thoughts on building stronger SRE individual contributors and emphasizes the importance of education as pivotal to ongoing reliability success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Mon, November 27, 2023
Ash Patel talks with Rick Boone who is a pioneer in SRE, having been an early AppOps engineer at Facebook and Uber's first SRE hire. He shares amazing stories from those pioneering days. Rick also draws from his experience to share his insights on how to build stronger SRE teams, as well as support effective career progression for individual contributor SREs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, November 21, 2023
Ash Patel interviews Sreejith Chelchery who is SVP of Delivery and Infrastructure Engineering at Dotdash Meredith. Sreejith shares his journey from programming analyst in Bangalore, India, to now being an executive responsible for platform engineering, DevOps, and SRE at a media giant in New York City. He gives a glimpse into how his team saved his organization over $9 million in cloud computing costs, how they started an internal developer platform well before Backstage was around, and more. Sreejith also sheds light on how changemakers and advocates like SREs can win over business and other non-technical stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, November 14, 2023
Ash Patel talks with Nash Seshan, who has supported reliability work in over 5 organizations, including Cisco, eBay, Dropbox, Lyft, Netflix, and Wayfair. He shares his learnings from reliability work at these big brands. Nash also draws from his experience as co-founder of a Y Combinator-funded startup on effective engineering leadership. He also gives his take on issues with ill-conceived automation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, November 07, 2023
Ash Patel talks with Ivan Merrill of Fiberplane about wrangling the big data that incidents and systems generate through collaborative notebooks. Ivan also touches on how open-source tools like Autometrics enable deeper observability of code by increasing the granularity of data used for incident response and retrospectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, October 31, 2023
Ash Patel talks with Adriana Villela (CNCF Ambassador, OpenTelemetry contributor, and senior developer advocate at Lightstep) about the promise of OpenTelemetry for observability teams, as well as the challenges of doing it right. She also touches on engineering leadership topics, recalling her experience as a leader of platform engineering and observability teams. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, October 24, 2023
Ash Patel talks with Robert Ross of Firehydrant about his experience in offering incident management software to SREs and other software incident responders. Highlights include defining the broader concept of reliability, making smarter choices for handling incidents, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, October 17, 2023
Ash Patel interviews Rajesh Reddy N about his experiences as a senior DevOps and SRE individual contributor. Rajesh shares his insights on having systems to minimize alert fatigue, the importance of security in DevOps, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, October 10, 2023
Ash Patel interviews Kyle Forster of RunWhen about his experiences as an ex-Google director helping SREs and running an AI-based company that supports Kubernetes troubleshooting. Their conversation will cover themes like enabling junior SREs, the role of SRE in shift-left, and handling misaligned incentive models in organizations. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Mon, October 02, 2023
In this episode of the SREpath Podcast, Ash Patel interviews two SRE managers from Booking.com , Samuele and Yoann, to gain insights into their experiences and strategies for developing a successful SRE practice within a large organization. Yoann is a senior manager responsible for managing SRE teams and serves as the SRE Craft lead. Samuele is an SRE engineering manager working in the Big Data department and manages a team of eight to nine people. Yoann officially began his journey in SRE in 2017, transitioning from a consultancy role to an engineer focused on reliability. Samuele's background included network engineering and DevOps roles before he joined Booking.com in 2018 as an SRE. Booking.com initially didn't have SREs but started adopting SRE practices in 2017 as they transitioned from a monolithic architecture to microservices. The SRE team at Booking.com grew from around 20-30 members to nearly 200, with various teams handling infrastructure, central roles, and embedded roles with product teams. Learn more about the challenges they faced and tackled by listening to the episode. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Mon, September 11, 2023
Ash Patel interviews Pablo Bouzada about his beliefs on software reliability as a non-SRE leader. They discuss the importance of effective leadership to drive effective reliability changes in the software system, as well as the challenges of providing reliable service within video streaming giant, ViaPlay. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Tue, September 05, 2023
We haven't hit hard times, just doing other things for the last 2 months including making plans for more interesting episodes on this podcast! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, July 13, 2023
In this episode, we highlight the importance of engaging with HR partners to establish an effective understanding of the SRE career model. This will allow them to help with recruiting, hiring, and onboarding tailored to the SRE function. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, June 29, 2023
We discuss the need for a framework to guide the development of Site Reliability Engineers (SREs) and drive value for organizations. You will learn about our pillar view of areas like observability and service management, to identify areas for improvement and emphasize the importance of focusing on a few key areas at a time. We also discuss the challenges of hiring experienced SRE practitioners and suggest developing existing employees' skills and capabilities to become effective SREs. A capability view of SRE work can help establish a clear career path for SREs within an organization while aligning with acute organizational goals. Timestamps for key concepts Identifying SRE Pillars [00:00:20] Discussion of the different technology disciplines or practices that SREs can work in, such as observability, release engineering, service management, DevSecOps, performance and capacity engineering, platform engineering, and developer experience. Focus Areas for SREs [00:02:27] Importance of focusing on a few areas at a time and diving deep into them to identify and overcome challenges. The speakers discuss their current focus areas, which include observability, release engineering, and service management. Developing SRE Practitioners [00:06:00] Discussion of the challenges of hiring experienced SRE practitioners and the suggestion of developing existing employees' skills and capabilities to become effective SREs. The speakers highlight the need for a framework to guide the development of SREs and drive value for the organization. Establishing a Career Path for SREs [00:08:52] The speakers discuss the need to establish a career path for SREs within an organization, including developing existing employees' skills and capabilities to become effective SREs and setting proper expectations for each level of SRE. Collaborating with Other Departments and Teams [00:11:33] The speakers provide ideas for how SREs can collaborate with other departments and teams, including establishing regular communication channels, forming cross-functional teams, and encouraging knowledge sharing as a community within the organization. Reliability as an Organizational Conversation [00:13:20] The speakers emphasize the importance of reliability as an organizational conversation, involving not just engineering but also other partners such as product, care, strategic, and marketing teams, to make products and services for customers reliable. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, June 15, 2023
We discuss throughout this episode the different engagement models for Site Reliability Engineering (SRE) and how to contextualize SRE into an organization's structure. Sebastian Vietz, an experienced SRE practitioner, suggests five different engagement models for SRE and emphasizes the importance of considering the cost associated with each model. The hosts also discuss the different types of SREs that can exist within these engagement models, including SRE champions and unicorns. They stress the importance of considering organizational context when implementing SRE and tease a future episode where they will delve deeper into a framework for identifying the capabilities needed to solve SRE-related problems. Timestamps of key concepts Where and how SRE fits into an organization [00:00:20] We discuss the importance of considering organizational context when implementing SRE and explore different engagement models for SRE. Center of Excellence for Reliability Engineering [00:02:14] We discuss the idea of a center of excellence for reliability engineering, where a few practitioners take on an advisory role for the organization. Embedded SREs [00:04:14] We discuss the idea of embedding SREs into teams, where each team has an embedded SRE whose focus is to implement reliability engineering principles and best practices. Five SRE Engagement Models [00:08:23] We discuss five different engagement models for SRE, including embedded SREs, a center of excellence, and a consulting or ambassador model. Types of SREs [00:10:25] We discuss different personas that an SRE can take, including champions, advocates, and unicorns. Unicorn SREs [00:13:50] We discuss the rare and sought-after unicorn SREs, who have extensive experience and exposure to different business domains and contexts. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, June 01, 2023
This episode discusses how Site Reliability Engineering (SRE) can be important to organizations. SRE can optimize software operations, reduce costs, support revenue-driving areas, mitigate risks, improve cybersecurity, and enhance customer experiences. We will also cover how to integrate SRE into the organization's culture for continuous improvement and innovation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Wed, May 17, 2023
In this episode of SREpath, Ash and Sebastian discuss the unnecessary debate surrounding Site Reliability Engineering (SRE), DevOps, and platform engineering. They argue that these disciplines should not be pitted against each other, but rather seen as complementary and able to coexist within an organization. The focus should be on continuous improvement, learning from failures, and making things better. The hosts emphasize that practitioners in all three areas share the common goal of improvement and should collaborate rather than compete. They briefly distinguish SRE as focusing on system reliability and scalability, DevOps on collaboration and automation, and platform engineering on building and maintaining infrastructure. The decision to establish dedicated teams for each discipline depends on the organization's scale and needs. The hosts encourage a context-driven approach, where individuals from diverse backgrounds and skill sets can contribute to the SRE field. Ultimately, the key is to prioritize improvement and learning, regardless of labels or titles. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, May 04, 2023
In this episode of the SREpath podcast, Ash and Sebastian explore what Site Reliability Engineering (SRE) is and how it manifests in a highly functional organization. We also cover the controversial issue of what SRE is not. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Thu, April 20, 2023
Welcome to the first episode of the SREpath podcast! In this episode, we'll introduce you to our podcast hosts and give you their broad-level view of Site Reliability Engineering (SRE). We'll also share some points about how we'll be running future episodes. Whether you're an SRE expert or new to the field, this episode will provide valuable insights into SRE and what you can expect from our podcast series. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
loading...