New Hires: Learn How The System Breaks

We have previously discussed two ideas related to onboarding and finding impact.

The premise of Optimize Onboarding is that you should start doing real work right away when you join a new company.

The premise of Build Your Career on Dirty Work is that you should find the unsavory streams of work in an organization because they’re big opportunities.

Here we marry these two concepts: When you start a new role on a software team, you should immediately immerse yourself in the tooling and data on how things fail.

Most software teams have the following:

  • Incidents and incident data
  • Bugs and bug reports
  • On-call rotations and pages
  • Observability tooling (e.g. DataDog, New Relic, Honeycomb)
  • Release/revert tooling
  • Internal tooling and support tooling

These processes, data, and tooling are invaluable to see where your system fails and learn how it gets fixed. Your system’s failure ecosystem is important and valuable for a whole host of reasons.

First, onboarding into mature systems often can be an extremely daunting, many month (or year) process. This is true for both managers and ICs. Failure streams are a short circuit to understanding the system, because failures are where the system is interesting and nuanced. Failures are where the heart of complexity, entropy, and flux in the system are. Everything that doesn’t fail behaves like the architecture diagram. Failures show where the architecture isn’t working as intended. By focusing on failures, engineers can onboard quickly into the most important part of the system - the part with problems.

Second, failures are where opportunity lives. Technical visions, for example, are ultimately a way of addressing failures; understanding failures are a warp-speed button for being able to thoughtfully craft a technical vision.

In one of my early jobs I spent 9 months trying to find opportunities to improve things. I worked philosophically, from the features down. Around every corner I would find “we already optimize for that” or “it’s not really broken, probably not a priority”.

In contrast - a lot of things were breaking in the system, I just didn’t know about them. If I had tapped into the steam of failures, I would have been thinking about the actual issues that were occurring. That approach would have narrowed the space of consideration from millions of lines of code to hundreds of lines, and from hundreds of services to a few.

Finally, the tooling to detect and triage failures is often the most critical tool set in both the manager and IC toolbox. Knowing how to navigate your team’s observability software and release/revert mechanism is a superpower. Not only does learning the ins and outs of your team’s DataDog / New Relic / Honeycomb… etc. setup give you a much deeper understanding of the system, it gives you a language to inquire and investigate the system whenever you have questions or hypotheses.

Here are some very specific recommendations for your first month at a new job:

  • Learn the observability tooling and spend 20 minutes a day either investigating questions about the system (e.g. what’s the slowest thing?) or trying to debug active issues.
  • Shadow as many incidents as possible. Join the Zoom calls where people are triaging. Join the weekly incident review. Do this forever.
  • Learn where bug and page data lives. Go to bug triage and on-call handoff if they exist.
  • Shadow an on-call rotation for a week (during the business hours).
  • Learn about the release/revert mechanism and do hands-on release work.

Note: While much of this advice is engineering focused, most of it is also entirely doable by a somewhat technical PM. In fact, if you’re a somewhat technical PM and want to become a titan, I think these are high value ways to get there.

Also note: a lot of managers/teams might actually intuitively do the exact opposite of this advice – they try to shield new people from the problems to try and make things more pleasant, not scare them… etc. This is a classic case of good intentions working against people. You hire people to solve problems, so show them your problems right away.