Beyond Numbers: Turning Test Coverage into a Culture We Track, Enforce, and Improve

Developer Experience

Engineering

Quality

Beyond Numbers: Turning Test Coverage into a Culture We Track, Enforce, and Improve

Jitesh Badgujar

April 14, 2026

11 min read

At Apollo, 30+ engineering squads ship features daily. Speed was never the problem - coverage was. Every untested file was a silent risk, and our safety net was shrinking every week.

You can't improve what you can't measure. You can't enforce what isn't visible. And you can't sustain what isn't cultural.

We'd seen the mandate approach fail - engineers writing meaningless tests to game numbers, test suites validating incorrect behavior, coverage becoming a checkbox rather than a quality signal. So we went the other way: culture alongside compliance, visibility over mandates, opt-in adoption over top-down enforcement.

Today, our coverage has improved by 50% - not because we chased 100%, but because we built infrastructure that makes testing visible, accessible, and part of the workflow. Here's how we did it, what worked, and what we learned.

🎯 Why Test Coverage Matters

Quality is one of the hardest things to measure in engineering. Bug counts are lagging indicators. Incident rates are reactive. Code review quality is subjective. So when you're running a platform used by go-to-market teams worldwide with dozens of teams shipping daily, how do you know if your codebase is getting more resilient or more fragile?

Test coverage isn't a perfect metric, but it answers a question nothing else does: which parts of your codebase have zero safety net? No alert, no failing spec, nothing to catch a regression when someone touches an untested file. When we finally looked at ours, the picture wasn't pretty - frontend unit test coverage at Apollo was 24%, and falling.

📊 What We Track

We track coverage at two levels: overall platform coverage and team-specific coverage - so each squad sees exactly how their code is covered, not just a company-wide average.

The metrics break down into:

Frontend
- Unit tests - using Vitest - a fast unit testing framework for Vite-based projects.
- End-to-end tests - using Playwright - browser automation for real user flows.
- End-to-end tests (Chrome extension) - using WebdriverIO - automation for our Chrome extension user flows.
Backend - RSpec coverage.
Coverage Delta (pts) - change over a given period (weekly, monthly, or per sprint)

The delta is the one we watch most closely. A team at 35% trending upward is in better shape than a team at 55% that dropped 4 points. Direction matters more than position.

🛠️ How We Track It

Nightly CI runs generate coverage reports using V8 for Vitest, Istanbul/NYC for Playwright, and SimpleCov for RSpec.

Using CODEOWNERS, we map every file to the team that owns it. Metrics are pushed to Grafana via Prometheus, giving every team their own view - no manual categorization needed

On every PR, we check unit test coverage for newly added files and post a comment flagging any new code without tests. E2E coverage runs on demand. The goal is simple: no new file should land without visibility into whether it's tested.

const codeowners = new Codeowners(process.cwd(), 'CODEOWNERS');

for (const [filePath, coverageData] of Object.entries(coverageReport)) {
  let owners = codeowners.getOwner(relativeFilePath);
  owners.forEach((owner) => {
    // Aggregate coverage by owner
    coverageByCodeowner[owner].totalLines += coverageData.lines.total;
    coverageByCodeowner[owner].coveredLines += coverageData.lines.covered;
    // ... (functions, statements, branches)
  });
}

🖥️ The Grafana Dashboard: Making It Visible

The single biggest unlock wasn't a policy - it was a dashboard.

We built a unified Grafana dashboard fed by a nightly CI run against the latest master. Every morning, every team can see their frontend unit, frontend E2E, and backend coverage - broken down by team, compared to the previous week. Green means coverage improved. Red means it declined. The colors do the talking.

We review this on two cadences: weekly during quality reviews, where we call out declines and celebrate gains, and monthly during engineering metrics reviews for trend analysis and strategic planning. This regular rhythm keeps coverage top-of-mind without making it feel like an audit.

team wise test coverage split

team test coverage

🔍 The Missing Coverage Report: Making It Actionable

A dashboard tells you where you stand. The missing coverage report tells you where to start.

We generate a report that groups every file with zero coverage by team and directory, showing exactly which files have no unit tests, no E2E tests, or both. Engineers can filter by coverage range - 0%, 1-20%, 21-40% - and drill down to specific files and lines.

"Your team is at 45%" is abstract. "These 14 files in your design-system package have zero unit and zero Playwright coverage" is a starting point. Most team improvement efforts began here - pick the most critical uncovered files, write tests, repeat.

frontend test coverage

We also store detailed HTML reports on GCP. Engineers can drill into file-level, line-level coverage. "Lines 47-63 in EmailerCampaignDetails.tsx have zero coverage" helps a lot.

test coverage code

🚦 Catching Gaps at the PR Level

Dashboards show the big picture. The missing coverage report gives you a backlog. But we also needed something at the speed of a pull request.

We built a custom script - no paid tools, just TypeScript on top of the coverage JSON our test runners already produce - that posts a comment on every PR identifying new files with zero coverage, new files with partial coverage, and files where E2E tests provide coverage even without unit tests.

We deliberately accept E2E coverage as a valid signal. Some UI components are hard to test meaningfully with unit tests alone. We'd rather have an E2E test that validates real user behavior than a unit test that mocks everything away.

⛔ From Warnings to Blocking PRs

We didn't start with enforcement. We started with visibility and let the data do the convincing.

Phase 0 - Built coverage calculation for all our test frameworks.
Phase 1 - Added coverage comments to every PR. No blocking, just information. Awareness went up, but behavior change was slow.
Phase 2 - Teams opt in by adding fe_coverage_enforcement: true in their team config. PRs introducing new files with zero coverage get blocked to merge.
Phase 3 - Enforcement enabled for all teams ✅
Phase 4 - Coverage checks for modified files (coming next)

Sample comment: When PR is blocked

Screenshot 2026-02-12 at 4.38.34 PM.png

Unblock
- When the coverage check blocks a PR, the author adds unit/E2E tests for the newly added file(s) and pushes an update. CI re-runs automatically, and once the check is green, the PR can be merged.
Why this matters
- This makes missing tests an automatic PR signal, so teams do not rely on reviewers to notice and ask for tests before merge.

🛡️ P0 First: Not All Code Is Equal

The critical design choice: any coverage above 0% passes. Unit tests, Playwright - we don't prescribe which. Writing some test is infinitely better than writing none.

We tell teams: don't try to cover everything. Start with code that, if it broke, would directly impact users - payment flows, authentication, core workflow like search. Not every file/line needs a test, but critical paths should never ship uncovered.

🚪 Escape Hatches

For code that genuinely doesn't need testing, we have escape hatches - // code-coverage-ignore to exclude specific code, and a disable_coverage_enforcement label on PRs to unblock in urgent situations.

⏱️ Scaling Coverage Without Slowing CI

As coverage increases, CI can get slower and flakier. To keep running the full test suite on every commit, we invested in supporting CI tooling:

Knapsack Pro to balance and parallelize test execution, keeping CI duration under control.
Automated flaky-test reporting to continuously detect flaky tests , automatically create Jira tickets for the owning team, and drive quick fixes - so flakiness doesn’t become accepted noise.
Automated slow-test reporting to proactively identify runtime regressions and enable quick action before CI performance degrades, including automatic Jira ticket creation for the owning team.

🧹 Dead Code Removal

One trick that helped more than expected: we gave teams the ability to identify dead code in their codebase. Removing unused files improved coverage instantly.

In one cleanup pass alone, teams removed 4,000+ lines of unused code and saw roughly a ~5% increase in test coverage.

For backend, we identify dead code using Tombstones by checking what never gets triggered in production over ~30 days. For frontend, we use Knip to detect unused files/exports.

🤖 AI as a Force Multiplier

Our coverage push coincided with our AI tooling rollout. Engineers used AI to generate test boilerplate and produce comprehensive suites faster.

To make this workflow consistent , we standardized:

Skills + AI rules: reusable Vitest and Playwright workflows, guided by guardrails that enforce Apollo repo patterns and testing expectations so generated tests are closer to merge-ready.
Best practices doc: a review checklist and guidelines to ensure AI-generated tests validate real behavior and do not just chase line coverage.

AI lowered the activation energy for writing tests dramatically. But it's not a replacement for thoughtful testing - AI-generated tests still need human review to ensure they validate real behavior, not just achieve line coverage.

🤝 No Magic Number: No Threshold

This was the most debated decision we made.

We have no minimum coverage threshold. Not 80%. Not 50%. Not even 30%. We've seen what threshold-driven cultures produce - engineers writing tests to satisfy a number rather than to protect behavior. Coverage becomes a game to win instead of a quality signal to trust.

We categorized teams into three tiers to frame conversations, not to gatekeep:

Category	Coverage Range	What It Means
🟢 Healthy	>70%	Maintain and celebrate
🟡 Satisfactory	40-70%	Room to grow, not a fire
🔴 Needs Attention	<40%	Let's talk about what's blocking you

📅 Keeping It Alive: The Weekly Rhythm

Tools and dashboards only work if someone's looking at them. During our bi-weekly quality review, we go through every team - red means declining, green means improving. We call out declines, celebrate gains, and identify specific files or components that need attention. Monthly, during engineering metrics reviews, we zoom out for trend analysis and strategic planning.

The rhythm matters more than the rules. Coverage stays top-of-mind not because it's enforced, but because it's talked about - regularly, publicly, and without blame.

🏆 The Results: Coverage as a Leading Indicator

Here's where we landed after six months:

Metric	Before	After
FE Unit Test Coverage	~24%	~50%
Teams with Declining Coverage	60%	8%
Teams Opted Into Enforcement	0	All
PRs Blocked for Coverage	0	~10/week

⭐ Cultural shift: teams moved from 0 opt-ins to every squad opting into coverage enforcement - everyone is in on it and working toward it now.

The numbers matter, but what they represent matters more. Coverage went from something nobody tracked to something teams talk about in weekly reviews, PR comments, and sprint planning.

✨ Impact

Impact on Product Quality
- ~35% reduction in high-severity (Sev-0/Sev-1) production incidents over the last 1 year, reflecting fewer customer-impacting regressions from recent changes.
Engineering Efficiency
- Faster PR review because coverage gaps are highlighted automatically
- More confidence shipping changes (especially in high-risk areas)

💡 What We Learned - Key Takeaways

Visibility beats mandates. The dashboard and PR comments changed more behavior than any policy document ever could.
Direction over position. We celebrate the delta, not the absolute number. A team trending up matters more than a team sitting high.
Low friction, high adoption. Any coverage above 0%, unit or E2E, with escape hatches. Engineers felt supported, not punished.
Weekly cadence is non-negotiable. Without regular reviews, dashboards become wallpaper.
Start with P0. Focusing on critical paths first made the effort manageable and the impact real.
AI accelerates everything. Using AI tooling for test generation lowered the activation energy dramatically.
🚀 What’s Coming Next - Add a new leading indicator: track coverage of critical user journeys so we know real end-to-end workflows are protected, not just lines of code!

🙏 Acknowledgments

This wasn't a solo effort. Thanks to every engineer who wrote tests, reviewed dashboards, and held each other accountable.

Thanks to our engineering managers for making coverage a priority within their teams.

Special thanks to Marcin Natanek for helping build the Grafana dashboards and Umang Galaiya for reviewing the coverage enforcement proposal document and PR, Also, special thanks to Tejas H N and Neil Ongkingco for driving the dead code removal initiative, and to the SDET team for ensuring CI stays fast and doesn’t become a bottleneck.

Huge thanks to our leaders Dzmitry Markovich, Aidan Li and Rachel Vilceus for the leadership buy-in that made this initiative possible.

Ultimately, this work wasn’t about raising a metric - it was about shipping a more reliable product and reducing the regressions customers feel.

We hope sharing our journey helps other teams facing the same challenge.

Engineering

The Email Analyzer That Explains Itself

Apollo's Email Analyzer helps SDRs write better cold emails with instant, actionable feedback - built by using ML to discover rules, not just make predictions.

January 28, 2026

Read article →

Developer Productivity

Engineering

Engineering Leadership

How We Measured AI Tooling Productivity Gain Across 250+ Engineers at Apollo.io

At Apollo.io, we ran a company-wide experiment across 250+ engineers to quantify the real productivity impact of AI-powered developer tooling. This post breaks down the metrics we chose, the methodology behind the experiment, and the hard lessons we learned about measuring engineering productivity without gaming the system.

January 16, 2026

Read article →

Data Pipeline

Engineering

Software

A Reliable Testing and Evaluation Suite for AI Assistant

At Apollo, we engineered a custom AI Assistant Regression Suite to reliably test a multi-agent, multi‑turn AI assistant by converting real production conversations into replayable tests that semantically match intent while verifying side effects.

December 2, 2025

Read article →

Engineering

Software

Intelligent Email Deliverability Debugging

Apollo transformed email deliverability with an AI-powered agent that cuts debugging time from weeks to minutes. Built with the ReACT framework, it systematically analyzes metrics, authentication, bounce logs, and domain setup to pinpoint issues and offer real-time solutions. Continuous evals ensure reliability, while an updated knowledge base keeps the agent current with Apollo’s evolving features. By combining systematic analysis with domain expertise, Apollo turned a complex, frustrating process into a streamlined, AI-assisted experience, helping users maximize inbox reach and ROI faster than ever before.

September 17, 2025

Read article →