work record · May 2025 – Aug 2025

Ford Motor Company

Software Engineering Intern · Waterloo, ON

stack

C++CAN BusKafkaGraph Algorithms

role

Software Engineering Intern

01context

Shipping in something with wheels

First time writing software that ships in a thing with wheels. Two things made it harder than any prior internship.

Real-time C++ has no escape hatch. A web service that takes 200ms longer is a Datadog alert. A tire-pressure ISR that takes 200ms longer can fail a federal compliance test, and missing that gate after freeze costs millions and a quarter of revenue. So you instrument first, hypothesize second, change code third.

Review is heavy on purpose. Anything touching a safety path needs two reviewers from the embedded platform team, a static-analysis pass (Coverity + MISRA-C++), and a hardware-in-the-loop run on the bench. The auto-triage project came out of pure frustration with that loop.

02safety

TPMS interrupt path

Tire-Pressure Monitoring is federally mandated under FMVSS 138.^[1] Each wheel has a battery-powered sensor that radios pressure to the body control module. Pressure drops, dashboard lights up. The reg specifies a detection window, not per-event latency, but slow per-event latency eats into that window once you stack averaging filters on top.

fig 01 — chart1400 ms → 612 ms · −56%

fig 01Baseline vs optimised TPMS interrupt path · 1400 ms to 612 ms

Three things were eating budget. One: the ISR ran at low priority and could get preempted by the OBD-II diagnostics handler. Two: inside the ISR the 64-byte sensor frame was memcpy'd into a heap-allocated buffer before being queued. Fine on x86, painful on the MCU because the allocator briefly takes an IRQ-disabling lock. Three: a 5 Hz polling loop pulled queued frames into the CAN transmit task, adding up to 200ms of jitter for no reason.

I bumped the TPMS IRQ to the same priority tier as airbag-deploy notifications (cleared with safety, documented in the FMEA update). Killed the ISR-side memcpy with a lock-free ring buffer of pre-allocated 64-byte slots. ISR writes the frame index, CAN task reads it. Replaced the polling loop with an event-driven wake on the producer index.

End-to-end landed at 612ms on the bench, validated across temperature corners (−40 °C to +85 °C) and confirmed on the HIL rig. Passed FMVSS 138 with margin before freeze.

03routing

Fuel-efficient routing

The 2027 F-150 ICE lineup and the Mach-E share a routing subsystem. Spec was a re-ranker that takes the top-K candidate routes from the existing nav engine and re-scores them on energy cost instead of time or distance.

Edge weights aren't constants. They depend on vehicle state.

cpp// excerpt

// edge cost = energy proxy in joules, lower is better
double cost(const Edge& e, const VehicleState& v, Time t) {
  return alpha  * distance(e)
       + beta   * grade(e)        * v.mass
       - gamma  * regen_credit(e, v)        // EV only
       + delta  * traffic_prior(e, t)
       + eps    * hvac_load(e, t);
}

Per-edge cost. γ is zero for ICE; δ is reweighted on F-150.

fig 02 — charttop-k re-rank on energy

fig 02Same candidates, re-ranked on energy cost · regen flips the pick

The Mach-E gets a regen term (downhill recovers ~30% of kinetic energy through the motor). The F-150 doesn't, so γ collapses to zero and δ gets reweighted. Elevation came from the existing tile cache, traffic priors from a 90-day rolling per-segment ETA table.

Prototyped with A* because the heuristic (great-circle distance × min-cost-per-km) is admissible and easy to reason about. For prod I switched to contraction hierarchies^[2] since the graph is mostly static between OTA map updates, so preprocessing amortizes. Query latency on the head unit's ARM Cortex-A53 stayed under 80ms for 300km routes.

fig 03 — chartlower is better

fig 03Average detour distance, baseline vs new ranker, across a 4K-trip sim set.

The 41% overall came mostly from the Mach-E. Regen credit makes "longer but downhill" genuinely cheaper, which the old time-optimal ranker would never pick.

04triage

Auto-triage on 230 test rigs

We had 230 hardware-in-the-loop rigs running nightly. Each rig emitted ~60 events/hr, ~14K aggregate. The status quo was a Slack firehose and 40 firmware engineers manually skimming for their failures every morning.

I wrote a classifier that consumes the Kafka stream^[3], tags each failure by signature, and routes it to the owner. Core is a priority-ordered regex and heuristic table. Fancy ML wasn't needed because the failure modes are narrow.

cpp// excerpt

enum class FailureKind { Crash, Leak, Threading, Other };

FailureKind classify(const TestEvent& e) {
  const auto& log = e.tail_log;  // last 4KB
  if (log.contains("SIGSEGV") || log.contains("assertion failed"))
    return FailureKind::Crash;
  if (e.heap_delta_kb > 256 && e.duration_s > 60)
    return FailureKind::Leak;
  if (log.contains("deadlock") || log.contains("TSAN: data race") ||
      e.thread_count_peak > e.thread_count_baseline * 3)
    return FailureKind::Threading;
  return FailureKind::Other;
}

Failure classifier. Ordering matters.

fig 04 — chart14k events/hr · 230 rigs

fig 04Kafka firehose to classifier to owner inbox · 6h 20m to 3h 25m

Routing used git blame on the failing test file plus a CODEOWNERS lookup. Median PR review dropped from 6h 20m to 3h 25m. Most of the win was just the right person seeing the failure in 5 minutes instead of next morning.

05reflection

What I'd do differently

I underspecified the failure modes for triage routing. When git blame landed on a refactor commit, the wrong person got paged. I added a 2nd-best fallback late in the term, but the right answer is probably blame-by-line-range weighted by recency. TPMS passed HIL but I never got to see field telemetry post-freeze, and I'd push harder next time to ride along on the validation fleet. On the routing ranker, I never benchmarked honest A* at production scale. I assumed CH would win and it did, but I should have the data to back it for the next reviewer who asks.

/ footnotes

[1]NHTSA, FMVSS No. 138, Tire Pressure Monitoring Systems. nhtsa.gov/laws-regulations/fmvss. ↩
[2]Geisberger, Sanders, Schultes, Delling. Contraction Hierarchies: Faster and Simpler Hierarchical Routing in Road Networks. algo2.iti.kit.edu. ↩
[3]Apache Kafka consumer documentation. kafka.apache.org. ↩