âAlexey, do you feel the points you bring up during our post-mortems are productive?â my tech lead asked at our 1:1.
Well, shit. I had thought so, but apparently not.
Earlier in the year, I became the Engineering Manager on a team responsible for half of the outages at our 2,000 person company. After each incident, the on-call engineer would write-up a doc and schedule a meeting.
âHow come this wasnât caught in unit tests?â I found myself asking, in front of the assembled team. Next post-mortem, same thing. âI get that we didnât have monitoring for this particular metric, but why not?â Week after week.
The tech lead had asked a great question. Was my approach working?
âI want to set high expectations,â I told him. âItâs not pleasant being critiqued in a group setting, but my hope is that the team internalizes my âgood post-mortemâ bar.â
The words sounded wrong even as I said them.
âThanks for the feedback.â I said âLet me think on it.â
Feedback budgets
I thought about it.
Thereâs a limited budget for criticism one can ingest productively in a single sitting. Managers will try to extend this budget through famed best-practices like the shit-sandwich and the not-really-a-question question. Employees learn these approaches over time and develop an immunity.
This happened here. Once my questioning reached the criticism threshold, I was no longer âimproving the post-mortem culture.â I was âbuilding resentment and defensivenessâ.
I had run over budget. And yet, there was important feedback to give!
Change the template, change the world
Upon reflection, I ended up updating our post-mortem template. My questions became part of the template that got filled in before meeting.
This way, it was the template pestering the post-mortem author. My role was simply to insist that the template be filled out; an entirely reasonable ask.
Surprisingly enough, this worked; post-mortems became more substantive. The team pared down outage frequency and met OKR goals.
Process linters
One Simple Trick I had stumbled into was that there was a way to get around feedback budgets. Turns out thereâs this other, vaster budget to tap into: the budget of process automation. When feedback is automated, it arrives sooner, feels confidential, and lacks judgement. This makes it palatable; this is why the budget is vaster.
The technical analogy here is how we use linters. âNit: donât forget to explicitly handle the return valueâ during code review feels mildly frustrating. Ugh. Itâs âjust a style thingâ and âthe code worksâ. Iâll make the change, but with slight resentment.
Yet, if that same âunhandled return valueâ nudge arrives in the form of a linter, itâs a different story. I got the feedback before submitting the code for review; no human had to see my minor incompetence.
As a software engineer, Have Good Linters is an obvious, uncontroversial best practice. The revelatory moment for me was that templates for documents were just another kind of linter.
Happy Ending
My insight completely transformed the way Opendoor Engineering thinks about feedback; I crowd-surfed, held aloft by the teamâs grateful arms, to receive my due praise as the master of all process improvement.
Just kidding; COVID-19 happened and I switched jobs.
The Appendices Three
I: Process linters seen in the wild
Meetings
Feedback âwe have too many meetingsâ; âwhatâs the point of this meetingâ; âdo I need to be hereâ Linter mandate no-meetings days; mandate agendas; mandate a hard max on attendee count.
Progress Updates
Feedback âHey, howâs that project going? Havenât heard from you in a bitâ Linter Daily stand-ups (synchronous or in slack/an app); issue trackers (Linear, Asana, Jira, Trello)
Bug Reports
Feedback âHey, a friend who uses the app said that our unsubscribe page is broken?â Linter Quality pre-deploy test coverage, automated error reporting (Sentry), Alerting on pages or business metrics having anomalous activity patterns (Datadog).
II: Youâve gone too far with this process crap
The process budget is vaster than the feedback budget, but it isnât unlimited. A mature company is going to have lots of legacy process - process debt, if you will.
Process requires maintenance and pruning, to avoid âwe do this because weâve always done thisâ type problems. High-process managers are just as likely to generate unhappy employees as high-feedback managers.
III: The post-mortem template changes, if thatâs what youâre here for
A. â5 Whysâ Prompts
Our original 5 Whys prompt was âWhy did this outage occur.â During the post-mortem review, I kept asking questions like âbut why didnât this get caught in regression testing?â
So, after discussion, I added my evergreen questions to the post-mortem template. They are:
- Why didnât the issue get caught by unit tests?
- Why didnât the issue get caught by integration/smoke tests?
- Why didnât the issue get flagged in Code Review?
- Why didnât the issue get caught during manual QA?
- If the outage took over an hour to get discovered, why didnât the monitoring page our on-call?
B. Defining âRoot Causeâ
â5 Whysâ recommends continuing to ask why until youâre about five levels deep. We were often stopping at one or two.
To make stopping less ambiguous, here are a set of âroot causesâ that I think are close to exhaustive:
- trade-off we were aware of this concern but explicitly made the speed-vs-quality trade-off (IE, not adding tests for an experiment). This was tech debt coming back to bite us.
- knowledge gap the person doing the work was not aware that this kind of error was even possible (IE, tricky race conditions, worker starvation)
-
brain fart now that we look at it, we should have caught this earlier. âJust didnât get enough sleep that nightâ kind of thing.
If you keep asking âwhyâ but havenât gotten to an answer that boils down to one of these, keep going deeper (or get a second opinion).
Tags: #engineering-management