Caveat: Yes, yes, almost everything about the interviewing / recruiting process is broken. Sometimes though, you just have to play the hand you’re dealt and settle for minor improvements.

    The 75-minute HMTPS is my proposed minor improvement.

    Hat tip to The Oatmeal

    What is the HMTPS

    It stands for “Hiring Manager Technical Phone Screen.” Since you asked, I’ve been pronouncing it “ham-tips.” It’s the call a candidate will have after their RPS (Recruiter Phone Screen) but before their onsite.

    This combines two calls - the Technical Phone Screen (TPS), which is a coding exercise, and usually happens before the onsite, and the HMS call, which is a call with the Hiring Manager (your would-be manager), which I’ve seen done before an onsite, or after, or not at all.

    So I combine these into one. It takes 75 minutes.

    Why combine the two interviews?

    An ideal interview loop has as few steps as necessary and gets to a decision ASAP. Combine these two steps to shorten intro-to-offer by ~1 week and reduce candidate drop-off by 5-10%.

    It’s also a lot less work for recruiters playing scheduling battleship1.

    Finally, Hiring Managers will, on average, be better at selling working at the company - it’s kind of their job.

    Why 75 minutes?

    We’re combining a 30-minute call and a 60-minute call, and combining the 15-minute Q&A at the end of each into one.

    TPS (60m)

    • 5m Intros
    • 45m We write some code in Coderpad together
    • 10m Ask me Anything

    HM call (30m)

    • 5m Intros
    • 10m Dig into relevant experience & what candidate wants from next job
    • 15m AMA time.

    HMTPS (75m)

    • 5m Intros
    • 15m Dig into relevant experience & what candidate wants from next job
    • 30m Coderpad
    • 15m AMA time.
    • 10m buffer time (inevitably one of these will go long in an interesting way)

    I’m also more comfortable shortening the ~50 minute technical question into 30 minutes because (a) I’m pretty calibrated on my question, having run it 200+ times at this point, and so can get most of the signal I’m looking for within the first 30 minutes.

    I’ve tried doing this call in 60 minutes and it ends up feeling pretty rushed; not to say somebody else couldn’t pull that off, but I’ve appreciated the bit of space. Also, since most candidates don’t schedule in 15-minute increments, we can always go a little long (up to the 90 minute mark) if we need to.

    Why is this good for the Hiring Manager?

    First, it’s easier to schedule (usually towards the end of the day). Second, it usually gives me enough time with the candidate so that I end up being pretty confident about how they’ll do both at the job and on the onsite. I haven’t quantified this yet, but anecdotally I have been surprised by onsite interviewer feedback much more rarely when I do this.

    Why is this good for the candidate?

    It’s one fewer hoop to jump through. Also, whether or not they get along with me as their future manager - both technically and interpersonally - can and should be a pretty strong determinant as to whether they should continue with the process. This gives stronger signal since we are both coding together and talking about work.

    When is this a bad idea?

    This makes the Hiring Manager a bit of a bottleneck in interviewing; once a company gets to the point where you are interviewing for titles like “Senior Software Engineer, Team TBD” you have to round robin TPS-es to the rest of your [Phone Screen Team](/2020/12/05/technical-interview-superforcasters.html.

    Also, as the HM I likely have some unreasonable biases (Golang engineers, I’m looking at you), and making me the bottleneck in interviewing exacerbates those. That said, the HM’s bias is going to be applied sooner or later in the interview process, and my take is that the benefits outlined are worth it.

    1. Tuesday at 4? You sunk my Grooming Session! 

    Published: April 01 2021

    Let us prepare to grapple with the ineffable itself, and see if we may not eff it after all.

    – Douglas Adams, Dirk Gently’s Holistic Detective Agency

    The Situation

    “Ugh, the codebase is just such a mess,” my new Tech Lead said. “It’s just cruft on top of cruft, never cleaned up, always ‘after the next release’. No wonder we keep getting bug reports faster than we can fix them.”

    Not what you want to hear as the freshly-appointed Engineering Manager on a critical team. Leadership expects the team to deliver on key new features, but also, there better not be any voluntary churn.

    I went to talk to the Product Manager. “Tech Debt?” he said, “sure, we can tackle some tech debt - but let’s make sure to get some credibility first by hitting our OKRs. It won’t be easy.”

    How did it get this bad?

    Cut to three years earlier. I was a new hire on that very team. My onboarding buddy - let’s call him Buddy - and I bumped into a strange corner of the codebase.

    “Oh weird,” I said. “Should we fix that?”

    “I have a strategy for this that you can use,” Buddy said. “When you run into code that seems off, that feels worth fixing, you write the issue down in a separate text file. Then you go do useful work.”

    “Oh, I see. And eventually, you get back to the text file and fix the issues?”

    “Nope. But at least you’ve written it down.”

    Wikipedia describes learned helplessness as “behavior exhibited by a subject after enduring repeated aversive stimuli beyond their control.” Without support, this is how engineers come to feel about tech debt.

    When I came back to this team as a manager, I reached out to Buddy, who had left years ago. “The code is crap at Airbnb too,” he told me when we caught up, “but at least they pay well and I don’t have to work very hard.”

    So what did you do?

    I joined Airbnb.

    That’s not true. We tackled the tech debt. We shipped leadership’s key features, hit our OKRs, and cleaned up some terrible, long-overdue-for-deletion no-good code. Within 3 months, the team’s attitude about technical debt had begun to turn around.

    Here’s how.

    Tackling Technical Debt In Three Easy Steps

    Guaranteed1.

    Step 1. Empower

    The biggest reason technical debt exists is because Engineers have internalized that it’s not their job to fix it. Start-up mantras like “focus” and “let small fires burn” have lead to just that - small fires everywhere.

    “Get shit done” is a great mantra, but you still have to clean up after yourself.

    The fix here is cultural. Make it clear that engineers who identify debt and take time to tackle it are appreciated. Celebrate their work to peers. A friend once created a slack bot that called out any PR that deleted a significant amount of code. Engineers all across the company began striving to get featured.

    Now of course, the team does have actual work that needs doing. Empower doesn’t mean “ignore our actual work” - it means, “if you take a Friday to fix something that’s bothering you, I have your back.”

    Step 2. Identify

    If you’re on a team that hasn’t been rigorous about tackling tech debt, there’s probably lots of it and it’s unclear what could even be done. This is fixable.

    Organize a brainstorm with prompts like

    • What tasks take longer than they should?
    • What is the most embarrassing part of our code to explain to new hires?
    • What key pieces of our code have we under-invested in?

    This’ll set you up with a solid initial list for your Tech Debt backlog. For more ideas, run your codebase through a tool like CodeClimate to algorithmically point out the rough spots.

    The first time we ran a brainstorm like this, everybody agreed that we had a handful of ideas that were so easy and valuable enough that we should do them right away. Like, that day. It felt like a breath of fresh air. Things are fixable.

    Encourage folks to add to the Backlog anytime they ran into annoyances and didn’t have time to fix it right there and then. In future team retros or brainstorms, identify any tech debt that comes up and add it onto the backlog.

    Step 3. Prioritize

    Having a tech debt backlog and ignoring it is worse than none at all.

    Time to play Product Manager and use ICE to prioritize your tech debt on effort required to fix, impact that a fix would have on velocity, and confidence that the fix will actually work.

    This gives you a list of potential projects. Some will take months; others, hours.

    Now they just need to get done. That’ll require buy-in from your Product Manager.

    Getting Buy-in

    When it’s time to have “the talk” with your PM, I‘ve found “how often should you clean your room” to be useful analogy.

    Never cleaning your room is a bad idea and obviously so. Over time it becomes unlivable. This is how our engineers feel. At the same time, if you’re cleaning your room all day every day, that’s not a clean room, that’s excessive and no longer helpful. In moderation, messiness is healthy - it means you’re prioritizing. We don’t need a glistening-clean room, but we do need to do more than nothing. At the end of the day, a clean room is a productive room.

    Tackling Small Debt

    Come together with your Product Manager and agree to a rate at which small debt projects can get added to the team’s ticket queue. With a spiel like the above, you can hopefully get ~10% of all work done to focus on debt, depending on the maturity of the team and the company.

    For ~week-long projects, try to leverage particular times of year like Hack Weeks and pitch high-value projects to engineers looking for a fun project.

    Tackling Heavy Debt

    This is where good leadership helps. At this particular company, Engineering leadership had rolled out “Quality OKRs”. Every quarter, each team had to sign up for a meaningful “quality” OKR goal.

    What is a “quality” goal? This was left up to teams, but the gist of it was, just go fix the most painful thing that isn’t already translated in your business metrics.

    During quarterly planning, we whittled the top three “heavy tech debt” projects into proposals, got buy-in from leadership, then brought the ideas back to the group.

    Since quality projects had been blessed from top-down and indisputable, the PM had air cover to support the work without pushback.

    So what happened?

    Was there still tech debt? Yes. Did it continue to accumulate? Of course. But did it feel inexorable? Not anymore.

    1. Not guaranteed. 

    Published: December 05 2020

    Originally published as a guest blog post on interviewing.io. Thanks Aline!

    “The new VP wants us to double engineering’s headcount in the next six months. If we have a chance in hell to hit the hiring target, you seriously need to reconsider how fussy you’ve become.”

    It’s never good to have a recruiter ask engineers to lower their hiring bar, but he had a point. It can take upwards of 100 engineering hours to hire a single candidate, and we had over 50 engineers to hire. Even with the majority of the team chipping in, engineers would often spend multiple hours a week in interviews. Folks began to complain about interview burnout.

    Also, fewer people were actually getting offers; the onsite pass rate had fallen by almost a third, from ~40% to under 30%. This meant we needed even more interviews for every hire.

    Visnu and I were early engineers bothered most by the state of our hiring process. We dug in. Within a few months, the onsite pass rate went back up, and interviewing burnout receded.

    We didn’t lower the hiring bar, though. There was a better way.

    Introducing: the Phone Screen Team

    We took the company’s best technical interviewers and organized them into a dedicated Phone Screen Team. No longer would engineers be assigned between onsite interviews and preliminary phone screens at recruiting coordinators’ whims. The Phone Screen Team specialized in phone screens; everybody else did onsites.

    Why did you think this would be a good idea?

    Honestly, all I wanted at the start was to see if I was a higher-signal interviewer than my buddy Joe. So I graphed people’s phone screen pass rate against how those candidates performed in their onsite pass rate.

    Joe turned out to be the better interviewer. More importantly, I stumbled into the fact that a number of engineers doing phone screens performed consistently better across the board. They both had more candidates pass their phone screens and then those candidates would get offers at a higher rate.

    Sample Data, recreated for Illustrative Purposes.

    These numbers were consistent, quarter over quarter. As we compared the top quartile of phone screeners to everybody else, the difference was stark. Each group included a mix of strict and lenient phone screeners; on average, both groups had a phone screen pass rate of 40%.

    The similarities ended there: the top quartile’s invitees were twice as likely to get an offer after the onsite (50% vs 25%). These results also were consistent across quarters.

    Armed with newfound knowledge of phone screen superforecasters, the obvious move was to have them do all the interviews. In retrospect, it made a ton of sense that some interviewers were “just better” than others.

    A quarter after implementing the new process, the “phone screen to onsite” rate stayed constant, but the “onsite pass rate” climbed from ~30% to ~40%, shaving more than 10 hours-per-hire (footnote 2). Opendoor was still running this process when I left several years later.

    You should too (footnote 3, footnote 4).

    Starting your own Phone Screen Team

    1. Identifying Interviewers (footnote 5)

    Get your Lever or Greenhouse (or ATS of choice) into an analyzable place somewhere, and then quantify how well interviewers perform. There’s lots of ways to analyze performance; here’s a simple approach which favors folks who generated lots of offers from as few as possible onsites and phone screens.

    formula

    You can adjust the constants to where zero would match a median interviewer. A score of zero, then, is good.

    Your query will look something like this:

    Interviewer Phone Screens Onsites Offers Score
    Accurate Alice 20 5 3 (45 - 20 - 20) / 20 = 0.25
    Friendly Fred 20 9 4 (60 - 36 - 20) / 20 = 0.2
    Strict Sally 20 4 2 (30 - 16 - 20) / 20 = -0.3
    Chaotic Chris 20 10 3 (45 - 40 - 20) / 20 = -0.75
    No Good Nick 20 12 2 (30 - 48 - 20) / 30 = -1.9

    Ideally, hires would also be included in the funnel, since a great phone screen experience would make a candidate more likely to join. I tried including them; unfortunately, the numbers get too small and we start running out of statistical predictive power.

    2. Logistics & Scheduling

    Phone Screen interviewers no longer do onsite interviews (except as emergency backfills). The questions they ask are now retired from the onsite interview pool to avoid collisions.

    Ask the engineers to identify and block off 4 hour-long weekly slots to make available to recruiting (recruiting coordinators will love you). Use a tool like youcanbook.me or calendly to create a unified availability calendar. Aim to have no more than ~2.5 interviews per interviewer per week. To minimize burnout, one thing we tried was to take 2 weeks off interviewing every 6 weeks.

    To avoid conflict, ensure that interviewers’ managers are bought in to the time commitment and incorporate their participation during performance reviews.

    3. Onboarding Interviewers

    When new engineers join the company and start interviewing, they will initially conduct on-site interviews only. If they perform well, consider inviting them into the phone screen team as slots open up. Encourage new members to keep the same question they were already calibrated on, but adapt it to the phone format as needed. In general, it helps to make the question easier and shorter than if you were conducting the interview in person.

    When onboarding a new engineer onto the team, have them shadow a current member twice, then be reverse-shadowed by that member twice. Discuss and offer feedback after each shadowing.

    4. Continuous Improvement

    Interviewing can get repetitive and lonely. Fight this head-on by having recruiting coordinators add a second interviewer (not necessarily from the team) to join 10% or so of interviews and discuss afterwords.

    Hold a monthly retrospective with the team and recruiting, with three items on the agenda:

    • discuss potential process improvements to the interviewing process
    • review borderline interviews with the group to review together, if your interviewing tool supports recording and playback
    • have interviewers read through feedback their candidates got from onsite interviewers and look for consistent patterns.

    5. Retention

    Eventually, interviewers may get burnt out and say things like “I’m interviewing way more people than others on my actual team - why? I could just go do onsite interviews.” This probably means it’s time to rotate them out. Six months feels about right for a typical “phone screen team” tour of duty, to give people a rest. Some folks may not mind and stay on the team for longer.

    Buy exclusive swag for team members. Swag are cheap and these people are doing incredibly valuable work. Leaderboards (“Sarah interviewed 10 of the new hires this year”) help raise awareness. Appreciation goes a long way.

    Also, people want to be on teams with cool names. Come up with a cooler name than “Phone Screen Team.” My best idea so far is “Ambassadors.”

    Conclusion

    There’s something very Dunder Mifflin about companies that create Growth Engineering organizations to micro-optimize conversion, only to have those very growth engineers struggle to focus due to interview thrash from an inefficient hiring process. These companies invest millions into hiring, coaching and retaining the very best sales people. Then they leave recruiting - selling the idea of working at the company - in the hands of an engineer that hasn’t gotten a lick of feedback on their interviewing since joining two years ago, with a tight project deadline on the back of her mind.

    If you accept the simple truth that not all interviewers are created equal, that the same rigorous quantitative process with which you improve the business should also be used to improve your internal operations, and if you’re trying to hire quickly, you should consider creating a Technical Phone Screen Team.

    FAQs, Caveats, and Pre-emptive Defensiveness

    1. Was this statistically significant, or are you conducting pseudoscience? Definitely pseudoscience. Folks in the sample were conducting about 10 interviews a month, ~25 per quarter. Perhaps not yet ready to publish in Nature but meaningful enough to infer from, especially considering the relatively low cost of being wrong.
    2. Why didn’t the on-site pass rate double, as predicted? First, not all of the top folks ended up joining the team. Second, the best performers did well because of a combination of skill (great interviewers, friendly, high signal) and luck (got better candidates). Luck is fleeting, resulting in a regression to the mean.
    3. What size does this start to make sense at? Early on, you should just identify who you believe your best interviewers are and have them (or yourself) do all the phone screens. Then, once you start hiring rapidly enough that you are doing about 5-10 phone screens a week, run the numbers and invite your best 2-3 onsite interviewers to join and create the team.
    4. What did you do for specialized engineering roles? They had their own dedicated processes. Data Science ran a take home, Front-End engineers had their own Phone Screen sub-team, and Data and ML Engineers went through the general full-stack engineer phone screen.
    5. Didn’t shrinking your Phone Screener pool hurt your diversity? In fact, the opposite happened. First, the phone screener pool had a higher percentage of women than the engineering organization at the time; second, a common interviewing anti-pattern is “hazing” - asking difficult questions and then rejecting somebody for “not even remembering about Kahn’s algorithm, lolz.” The best phone screeners don’t haze, bringing a more diverse group onsite.

    I’m here to warn you about the dangers of front-end user tracking. Not because Google is tracking you, but because it doesn’t track you quite well enough.

    What follows is a story in three parts: the front-end tracking trap I fell into, how we dug ourselves out, and how you can go around the trap altogether.

    Part 1: A Cautionary Tale

    The year was 2019. Opendoor was signing my paychecks.

    We were launching our shiny new homepage.

    We had spent a month migrating our landing pages from the Rails monolith to a shiny new Next.JS app. The new site was way faster and would therefore convert better, saving us millions of dollars annually in Facebook and Google ad costs.

    Being responsible, we ran the roll-out as an A/B test, sending half of the traffic to the old site so we could quantify our impact1.

    The impact we’d made was making things worse. Way worse. The new site got crushed.

    What happened?

    WTF. Google had told us our new page was way better. The new site even felt snappier.

    “Figure it out.” The engineers on revamp paired up with a Data Scientist and went to go figure out what the hell was going on. They started digging into every nook and cranny of the relaunch.

    A week went by. Our director peeked in curiously. Murmurs about postponing the big launch started to circle. Weight was gained; hair was lost.

    Ultimately, the clue that cracked the case was bounces. Bounces (IE, people leaving right away) were way up on the new site. But it was clear the new site loaded much faster. Bounce rates should have gone down, not up.

    How did we measure bounce rates? We dug in.

    How bounces work

    When the homepage loads, the front-end tracking code records a ‘page view’ event. If the ‘page view’ event was recorded, but then nothing else happens, analytics will consider that user to have “bounced”.

    It turned out that the old site was so slow that many folks left before their ‘page view’ ever got recorded. In other words, the old site was dramatically under-reporting bounces.

    It was like comparing two diet plans and saying the one where half the subjects quit was better because the survivors tended to lose weight.

    Part 2: How we fixed bounces

    If the front-end was under-reporting bounces, could we find a way to track a ‘page view’ without relying on the client?

    There was. It was on the server - though in our example, we tracked the event in Cloudflare, which we were already using for our A/B test setup.

    We started logging a page-about-to-be-viewed event instead of the page view event, which was really page-viewed-long-enough-for-the-tracking-javascript-to-load event. We updated our bounce metrics calculation.

    Lo and behold, the new infra was better after all! We had been giving our old page too much credit this entire time, but nobody was incentivized to cry wolf.

    Part 3: Front-end tracking done right

    Forsake the front-end. Tis a terrible place to track things, for at least three reasons.

    1. Performance

    The less JavaScript (especially third-party) you have on your landing pages, the better. It’s a better customer experience, and it improves your page’s conversion and quality score.

    We calculated that getting rid of Segment and Google Tag Manager on our landing pages would yield about 10-15 points of Google PageSpeed. Google takes PageSpeed into account for Quality Score, which in turn makes your CPMs/CPC cheaper.

    2. Fidelity

    Somewhere between a half and a quarter of all users have ad-blockers set up. If you’re relying on a pixel event to inform Google / Facebook of conversions, you’re not telling them about everybody. This makes it harder for their machine learning to optimize which customers to send your way. Which means you’re paying more for the same traffic.

    3. Powerlessness

    You want to believe that you have control of the JavaScript running on your page, but how many browser extensions does the user have? How much has actually loaded? Wait, what version of IE is this person on?

    What should i do instead?

    Take all your client-side tracking, and move it

    • to the edge for things like page views (the server is fine, here, though, if you KISS)
    • to the server for events that have consequences, like button presses.
    • to publishers for paid traffic conversion, inform Google/Facebook via their server-side APIs when feasible, instead of trying to load a pixel

    FAQs & Caveats

    Shouldn’t. We used Segment to identify anonymous users; the change was just calling .identify() in Cloudflare (and handling the user cookie there).

    I heard server-side conversion tracking for google and facebook doesn’t perform as well.

    I’ve heard (and experienced) this as well. We’re entering black magic territory here… try it.

    The End.

    Want to tell me I’m misinformed / on-point / needed? Hit me up.


    1. We explicitly only changed the infra which served our landing pages, and kept the content - the HTML/CSS/JS - identical. Once the new infra was shown to work, we would begin to experiment with the website itself. 

    “Alexey, do you feel the points you bring up during our post-mortems are productive?” my tech lead asked at our 1:1.

    Well, shit. I had thought so, but apparently not.

    Earlier in the year, I became the Engineering Manager on a team responsible for half of the outages at our 2,000 person company. After each incident, the on-call engineer would write-up a doc and schedule a meeting.

    “How come this wasn’t caught in unit tests?” I found myself asking, in front of the assembled team. Next post-mortem, same thing. “I get that we didn’t have monitoring for this particular metric, but why not?” Week after week.

    The tech lead had asked a great question. Was my approach working?

    “I want to set high expectations,” I told him. “It’s not pleasant being critiqued in a group setting, but my hope is that the team internalizes my ‘good post-mortem’ bar.”

    The words sounded wrong even as I said them.

    “Thanks for the feedback.” I said “Let me think on it.”

    Feedback budgets

    I thought about it.

    There’s a limited budget for criticism one can ingest productively in a single sitting. Managers will try to extend this budget through famed best-practices like the shit-sandwich and the not-really-a-question question. Employees learn these approaches over time and develop an immunity.

    This happened here. Once my questioning reached the criticism threshold, I was no longer “improving the post-mortem culture.” I was “building resentment and defensiveness”.

    I had run over budget. And yet, there was important feedback to give!

    Change the template, change the world

    Upon reflection, I ended up updating our post-mortem template. My questions became part of the template that got filled in before meeting.

    This way, it was the template pestering the post-mortem author. My role was simply to insist that the template be filled out; an entirely reasonable ask.

    Surprisingly enough, this worked; post-mortems became more substantive. The team pared down outage frequency and met OKR goals.

    Process linters

    One Simple Trick I had stumbled into was that there was a way to get around feedback budgets. Turns out there’s this other, vaster budget to tap into: the budget of process automation. When feedback is automated, it arrives sooner, feels confidential, and lacks judgement. This makes it palatable; this is why the budget is vaster.

    The technical analogy here is how we use linters. “Nit: don’t forget to explicitly handle the return value” during code review feels mildly frustrating. Ugh. It’s “just a style thing” and “the code works”. I’ll make the change, but with slight resentment.

    Yet, if that same “unhandled return value” nudge arrives in the form of a linter, it’s a different story. I got the feedback before submitting the code for review; no human had to see my minor incompetence.

    As a software engineer, Have Good Linters is an obvious, uncontroversial best practice. The revelatory moment for me was that templates for documents were just another kind of linter.

    Happy Ending

    My insight completely transformed the way Opendoor Engineering thinks about feedback; I crowd-surfed, held aloft by the team’s grateful arms, to receive my due praise as the master of all process improvement.

    Just kidding; COVID-19 happened and I switched jobs.


    The Appendices Three

    I: Process linters seen in the wild

    Meetings

    Feedback “we have too many meetings”; “what’s the point of this meeting”; “do I need to be here” Linter mandate no-meetings days; mandate agendas; mandate a hard max on attendee count.

    Progress Updates

    Feedback “Hey, how’s that project going? Haven’t heard from you in a bit” Linter Daily stand-ups (synchronous or in slack/an app); issue trackers (Linear, Asana, Jira, Trello)

    Bug Reports

    Feedback “Hey, a friend who uses the app said that our unsubscribe page is broken?” Linter Quality pre-deploy test coverage, automated error reporting (Sentry), Alerting on pages or business metrics having anomalous activity patterns (Datadog).

    II: You’ve gone too far with this process crap

    The process budget is vaster than the feedback budget, but it isn’t unlimited. A mature company is going to have lots of legacy process - process debt, if you will.

    Process requires maintenance and pruning, to avoid “we do this because we’ve always done this” type problems. High-process managers are just as likely to generate unhappy employees as high-feedback managers.

    III: The post-mortem template changes, if that’s what you’re here for

    A. “5 Whys” Prompts

    Our original 5 Whys prompt was “Why did this outage occur.” During the post-mortem review, I kept asking questions like “but why didn’t this get caught in regression testing?”

    So, after discussion, I added my evergreen questions to the post-mortem template. They are:

    • Why didn’t the issue get caught by unit tests?
    • Why didn’t the issue get caught by integration/smoke tests?
    • Why didn’t the issue get flagged in Code Review?
    • Why didn’t the issue get caught during manual QA?
    • If the outage took over an hour to get discovered, why didn’t the monitoring page our on-call?
    B. Defining “Root Cause”

    “5 Whys” recommends continuing to ask why until you’re about five levels deep. We were often stopping at one or two.

    To make stopping less ambiguous, here are a set of “root causes” that I think are close to exhaustive:

    • trade-off we were aware of this concern but explicitly made the speed-vs-quality trade-off (IE, not adding tests for an experiment). This was tech debt coming back to bite us.
    • knowledge gap the person doing the work was not aware that this kind of error was even possible (IE, tricky race conditions, worker starvation)
    • brain fart now that we look at it, we should have caught this earlier. “Just didn’t get enough sleep that night” kind of thing.

      If you keep asking “why” but haven’t gotten to an answer that boils down to one of these, keep going deeper (or get a second opinion).