Reliability engineering

Nightly medical-service job stabilization

A nightly appointment-processing job looked healthy in production but silently left recoverable billable services behind. I rebuilt the run around bounded batches, caching, fault isolation, and categorized operational reporting.

Role: production investigation, backend refactor, recovery logic, observability
Result: the stabilized job helped recover approximately 5,300 missing billable medical services and made future backfills safer to operate.

Recovered services: about 5,300
Failure boundary: per appointment
Run shape: windowed and chunked

Problem

A nightly appointment-processing job was responsible for finding recent clinical appointments, checking which ones were still missing billable medical services, mapping those appointments to the right service blocks, and writing the generated services back into the hospital billing workflow.

The job looked straightforward on paper. In production, it was quietly leaving recoverable work behind. Some appointments were skipped because the run operated on too much data at once. Others failed because one invalid service block or missing configuration could interrupt otherwise valid appointments.

I investigated the missing services end to end, then refactored the nightly job around batching, caching, fault isolation, and operational visibility. The stabilized job helped recover approximately 5,300 missing billable medical services and made future backfills and nightly runs much safer to operate.

Domain names, identifiers, and internal implementation details are generalized here, while the architecture and engineering decisions are preserved.

Before

The important issue was the shape of the pipeline. It behaved like one long lane of work, where data volume, repeated lookups, and error handling all compounded each other.

Original nightly run The old shape loaded a large range, mixed failure categories, and gave only a limited summary.


                          flowchart TD
    Scheduler[Nightly scheduler] --> FullRange[Load full appointment range]
    FullRange --> Existing[Find existing services]
    Existing --> Missing[Appointments without services]
    Missing --> Config[Load service configuration]
    Config --> Blocks[Build service blocks]
    Blocks --> Write[Write generated services]
    Write --> Summary[Limited run summary]

    FullRange -. production scale .-> Pressure[Large DB and API pressure]
    Blocks -. mixed failures .-> Unknown[Unclear loss reasons]
    Write -. failure blast radius .-> Missed[Recoverable services missed]

After

The refactor changed the job from a fragile bulk process into a bounded, observable recovery pipeline. Each stage now has a smaller responsibility and a clearer failure boundary.

Stabilized recovery pipeline The new shape keeps each unit of work bounded, cacheable, and recoverable at appointment level.


                          flowchart TD
    Scheduler[Nightly scheduler] --> Auth[Authenticate once]
    Auth --> Windows[Split appointment range into date windows]
    Windows --> Load[Load appointments per window]
    Load --> Existing[Chunked lookup of already generated services]
    Existing --> Candidates[Appointments still missing services]

    Candidates --> ConfigCache[Run-level configuration cache]
    ConfigCache --> Prepared[Prepare one appointment at a time]

    Prepared --> LookupCache[Cached GraphQL and metadata lookups]
    LookupCache --> Blocks{Billable service blocks found?}

    Blocks -->|yes| Write[Write generated service session]
    Blocks -->|no| Classify[Classify no-service reason]

    Write --> Result{Write result}
    Result -->|success| Recovered[Count recovered service]
    Result -->|validation issue| Classify
    Result -->|write error| LogContinue[Log and continue]

    Classify --> Counters[Categorized counters]
    LogContinue --> Counters
    Recovered --> Counters
    Counters --> Summary[End-of-run operational summary]

Windowed appointment processing

Instead of processing the full appointment range as one large unit, the job splits the date range into configurable windows. Each window loads appointments, checks for existing services, generates missing services, and then moves to the next window.

This keeps memory usage and query size bounded. It also makes backfills easier because historical ranges can be processed in predictable slices instead of turning every recovery run into one oversized production event.

Windowed recovery runs A larger appointment range is processed as several bounded windows before the totals are aggregated.


                          flowchart LR
    Range[10-day appointment range] --> W1[Window 1]
    Range --> W2[Window 2]
    Range --> W3[Window 3]
    W1 --> Generate1[Generate missing services]
    W2 --> Generate2[Generate missing services]
    W3 --> Generate3[Generate missing services]
    Generate1 --> Total[Aggregated run totals]
    Generate2 --> Total
    Generate3 --> Total

Run-level caching

Several expensive values are reused across appointments during one run: service configuration, cost center data, responsible doctor settings, case type, and service block metadata.

The refactor introduced run-level caches around these lookups. Identical requests reuse the same promise, and failed requests are removed from the cache so later appointments can retry instead of inheriting a poisoned cache entry.

Lookup reuse within one run Repeated appointment preparation can reuse metadata safely without making cache failures permanent.


                          flowchart LR
    AppointmentA[Appointment A] --> Cache{Lookup cache}
    AppointmentB[Appointment B] --> Cache
    AppointmentC[Appointment C] --> Cache
    Cache -->|miss| API[GraphQL or metadata service]
    API --> Cache
    Cache -->|hit| Reuse[Reuse result in same run]

Fault isolation

The most important reliability change was separating preparation failures from write failures and handling both at appointment level.

Preparation includes loading cost centers, responsible physicians, case information, and service blocks. Writing is the final GraphQL mutation that creates the service session. By isolating these steps, the job can skip one problematic appointment while continuing to recover the rest.

Known recoverable service-block errors are treated as data quality issues for that appointment, not as reasons to abandon the whole run. If an appointment still has valid billable blocks, the job writes them. If it has none, the job records why.

Observability

The stabilized job ends with categorized counters instead of a vague success/failure signal. That made the recovery work measurable and turned the job into a diagnostic tool.

When services are not generated, the team can now see whether the cause is appointment data, master data, service-block validity, validation behavior, or the write path.

Why it was good engineering

The result was a job that could be trusted for the actual business process, not just for technical execution.

Bounded workload: date windows and query chunking make the job safer at normal and backfill scale.
Less repeated work: caches reduce duplicate metadata and GraphQL calls.
Smaller failure blast radius: one bad appointment no longer prevents unrelated appointments from being recovered.
Better financial completeness: the system now finds and writes billable services that were previously missed.
Operational evidence: categorized logs and summaries show where losses happen and what still needs cleanup.

Impact

The stabilized pipeline helped recover approximately 5,300 missing billable medical services. More importantly, it changed the nightly job from a fragile process into a repeatable recovery mechanism that can be monitored, explained, and safely rerun.

For a billing-adjacent healthcare workflow, that matters because clinical activity is represented more completely downstream, finance teams have fewer silent gaps to reconcile manually, and engineers get clearer signals when master data or validation rules need attention.

Key metrics