Reliability engineering
Nightly medical-service job stabilization
A nightly appointment-processing job looked healthy in production but silently left recoverable billable services behind. I rebuilt the run around bounded batches, caching, fault isolation, and categorized operational reporting.
Key metrics
- Recovered services
- about 5,300
- Failure boundary
- per appointment
- Run shape
- windowed and chunked
Problem
A nightly appointment-processing job was responsible for finding recent clinical appointments, checking which ones were still missing billable medical services, mapping those appointments to the right service blocks, and writing the generated services back into the hospital billing workflow.
The job looked straightforward on paper. In production, it was quietly leaving recoverable work behind. Some appointments were skipped because the run operated on too much data at once. Others failed because one invalid service block or missing configuration could interrupt otherwise valid appointments.
I investigated the missing services end to end, then refactored the nightly job around batching, caching, fault isolation, and operational visibility. The stabilized job helped recover approximately 5,300 missing billable medical services and made future backfills and nightly runs much safer to operate.
Domain names, identifiers, and internal implementation details are generalized here, while the architecture and engineering decisions are preserved.
Before
The important issue was the shape of the pipeline. It behaved like one long lane of work, where data volume, repeated lookups, and error handling all compounded each other.
After
The refactor changed the job from a fragile bulk process into a bounded, observable recovery pipeline. Each stage now has a smaller responsibility and a clearer failure boundary.
Windowed appointment processing
Instead of processing the full appointment range as one large unit, the job splits the date range into configurable windows. Each window loads appointments, checks for existing services, generates missing services, and then moves to the next window.
This keeps memory usage and query size bounded. It also makes backfills easier because historical ranges can be processed in predictable slices instead of turning every recovery run into one oversized production event.
Run-level caching
Several expensive values are reused across appointments during one run: service configuration, cost center data, responsible doctor settings, case type, and service block metadata.
The refactor introduced run-level caches around these lookups. Identical requests reuse the same promise, and failed requests are removed from the cache so later appointments can retry instead of inheriting a poisoned cache entry.
Fault isolation
The most important reliability change was separating preparation failures from write failures and handling both at appointment level.
Preparation includes loading cost centers, responsible physicians, case information, and service blocks. Writing is the final GraphQL mutation that creates the service session. By isolating these steps, the job can skip one problematic appointment while continuing to recover the rest.
Known recoverable service-block errors are treated as data quality issues for that appointment, not as reasons to abandon the whole run. If an appointment still has valid billable blocks, the job writes them. If it has none, the job records why.
Observability
The stabilized job ends with categorized counters instead of a vague success/failure signal. That made the recovery work measurable and turned the job into a diagnostic tool.
When services are not generated, the team can now see whether the cause is appointment data, master data, service-block validity, validation behavior, or the write path.
Why it was good engineering
The result was a job that could be trusted for the actual business process, not just for technical execution.
- Bounded workload: date windows and query chunking make the job safer at normal and backfill scale.
- Less repeated work: caches reduce duplicate metadata and GraphQL calls.
- Smaller failure blast radius: one bad appointment no longer prevents unrelated appointments from being recovered.
- Better financial completeness: the system now finds and writes billable services that were previously missed.
- Operational evidence: categorized logs and summaries show where losses happen and what still needs cleanup.
Impact
The stabilized pipeline helped recover approximately 5,300 missing billable medical services. More importantly, it changed the nightly job from a fragile process into a repeatable recovery mechanism that can be monitored, explained, and safely rerun.
For a billing-adjacent healthcare workflow, that matters because clinical activity is represented more completely downstream, finance teams have fewer silent gaps to reconcile manually, and engineers get clearer signals when master data or validation rules need attention.