Ford Motor Company logo

work record · May 2025 – Aug 2025

Ford Motor Company

Software Engineering Intern · Waterloo, ON

stack
C++CAN BusKafkaGraph Algorithms
role
Software Engineering Intern
01context

Shipping in something with wheels

First time writing software that ships in a thing with wheels. Two things made it harder than any prior internship.

Real-time C++ has no escape hatch. A web service that takes 200ms longer is a Datadog alert. A tire-pressure ISR that takes 200ms longer can fail a federal compliance test, and missing that gate after freeze costs millions and a quarter of revenue. So you instrument first, hypothesize second, change code third.

Review is heavy on purpose. Anything touching a safety path needs two reviewers from the embedded platform team, a static-analysis pass (Coverity + MISRA-C++), and a hardware-in-the-loop run on the bench. The auto-triage project came out of pure frustration with that loop.

02safety

TPMS interrupt path

Tire-Pressure Monitoring is federally mandated under FMVSS 138.[1] Each wheel has a battery-powered sensor that radios pressure to the body control module. Pressure drops, dashboard lights up. The reg specifies a detection window, not per-event latency, but slow per-event latency eats into that window once you stack averaging filters on top.

fig 01 — chart1400 ms → 612 ms · −56%
A TPMS sensor frame travels from the wheel through the RF receiver, ISR, transport, and cluster firmware before lighting the dashboard lamp. On the baseline path low-priority preemption, an ISR-side heap memcpy, and a 5 Hz polling loop pile on about 1400 ms. After bumping the IRQ priority, swapping the memcpy for a lock-free ring buffer, and replacing the polling loop with an event wake, the same path settles at about 612 ms with the regulator-driven 800 ms debounce untouched.modebaselinelatency budgetfmvss 138 budget0ms · accumulatingsensor01rf receiverrf 315 mhz02isrlow prio · memcpy03transportpolling 5 Hz04clusterdebounce 800 ms+heap memcpy+200 ms preempt+200 ms poll jitterlock-free ring buffer · event wakeregulator · 800 ms!lamp−56%sensor → receiver → isr → transport → cluster → lamp
fig 01Baseline vs optimised TPMS interrupt path · 1400 ms to 612 ms

Three things were eating budget. One: the ISR ran at low priority and could get preempted by the OBD-II diagnostics handler. Two: inside the ISR the 64-byte sensor frame was memcpy'd into a heap-allocated buffer before being queued. Fine on x86, painful on the MCU because the allocator briefly takes an IRQ-disabling lock. Three: a 5 Hz polling loop pulled queued frames into the CAN transmit task, adding up to 200ms of jitter for no reason.

I bumped the TPMS IRQ to the same priority tier as airbag-deploy notifications (cleared with safety, documented in the FMEA update). Killed the ISR-side memcpy with a lock-free ring buffer of pre-allocated 64-byte slots. ISR writes the frame index, CAN task reads it. Replaced the polling loop with an event-driven wake on the producer index.

End-to-end landed at 612ms on the bench, validated across temperature corners (−40 °C to +85 °C) and confirmed on the HIL rig. Passed FMVSS 138 with margin before freeze.

03routing

Fuel-efficient routing

The 2027 F-150 ICE lineup and the Mach-E share a routing subsystem. Spec was a re-ranker that takes the top-K candidate routes from the existing nav engine and re-scores them on energy cost instead of time or distance.

Edge weights aren't constants. They depend on vehicle state.

cpp// excerpt
// edge cost = energy proxy in joules, lower is better
double cost(const Edge& e, const VehicleState& v, Time t) {
  return alpha  * distance(e)
       + beta   * grade(e)        * v.mass
       - gamma  * regen_credit(e, v)        // EV only
       + delta  * traffic_prior(e, t)
       + eps    * hvac_load(e, t);
}

Per-edge cost. γ is zero for ICE; δ is reweighted on F-150.

fig 02 — charttop-k re-rank on energy
Three candidate routes connect origin and destination. The F-150 picks the shortest. When the vehicle line flips to the Mach-E, the regen credit on the downhill route makes it the cheapest by energy and the pick slides from the time-optimal route to the energy-optimal one.vehiclef-150 (ice)top-k candidates · energy re-rankorigindestroute a · shortestroute b · uphillroute c · downhill · regen> picktime-optimal → energy-optimal
fig 02Same candidates, re-ranked on energy cost · regen flips the pick

The Mach-E gets a regen term (downhill recovers ~30% of kinetic energy through the motor). The F-150 doesn't, so γ collapses to zero and δ gets reweighted. Elevation came from the existing tile cache, traffic priors from a 90-day rolling per-segment ETA table.

Prototyped with A* because the heuristic (great-circle distance × min-cost-per-km) is admissible and easy to reason about. For prod I switched to contraction hierarchies[2] since the graph is mostly static between OTA map updates, so preprocessing amortizes. Query latency on the head unit's ARM Cortex-A53 stayed under 80ms for 300km routes.

fig 03 — chartlower is better
0.02.04.06.08.010.08.4f-150 base5.1f-150 new6.9mach-e base3.8mach-e new7.6overall base4.5overall newvehicle linedetour km
fig 03Average detour distance, baseline vs new ranker, across a 4K-trip sim set.

The 41% overall came mostly from the Mach-E. Regen credit makes "longer but downhill" genuinely cheaper, which the old time-optimal ranker would never pick.

04triage

Auto-triage on 230 test rigs

We had 230 hardware-in-the-loop rigs running nightly. Each rig emitted ~60 events/hr, ~14K aggregate. The status quo was a Slack firehose and 40 firmware engineers manually skimming for their failures every morning.

I wrote a classifier that consumes the Kafka stream[3], tags each failure by signature, and routes it to the owner. Core is a priority-ordered regex and heuristic table. Fancy ML wasn't needed because the failure modes are narrow.

cpp// excerpt
enum class FailureKind { Crash, Leak, Threading, Other };

FailureKind classify(const TestEvent& e) {
  const auto& log = e.tail_log;  // last 4KB
  if (log.contains("SIGSEGV") || log.contains("assertion failed"))
    return FailureKind::Crash;
  if (e.heap_delta_kb > 256 && e.duration_s > 60)
    return FailureKind::Leak;
  if (log.contains("deadlock") || log.contains("TSAN: data race") ||
      e.thread_count_peak > e.thread_count_baseline * 3)
    return FailureKind::Threading;
  return FailureKind::Other;
}

Failure classifier. Ordering matters.

fig 04 — chart14k events/hr · 230 rigs
A fleet of 230 hardware-in-the-loop rigs publishes roughly 14,000 test events per hour through Kafka. A classifier tags each failure as crash, leak, threading, or other, then routes it to the responsible team's inbox. Median PR review time drops from 6h 20m to 3h 25m.pipelineauto-routethroughput14k / hrrig fleetkafkaclassifierowner inboxes230 rigs · 14k / hrclassify(...)tag · routeCcrash team00Lleak team00Tthreading team00Oother / triaged00median pr review time6h 20m → 6h 20mbefore → after · classifier on−46%
fig 04Kafka firehose to classifier to owner inbox · 6h 20m to 3h 25m

Routing used git blame on the failing test file plus a CODEOWNERS lookup. Median PR review dropped from 6h 20m to 3h 25m. Most of the win was just the right person seeing the failure in 5 minutes instead of next morning.

05reflection

What I'd do differently

I underspecified the failure modes for triage routing. When git blame landed on a refactor commit, the wrong person got paged. I added a 2nd-best fallback late in the term, but the right answer is probably blame-by-line-range weighted by recency. TPMS passed HIL but I never got to see field telemetry post-freeze, and I'd push harder next time to ride along on the validation fleet. On the routing ranker, I never benchmarked honest A* at production scale. I assumed CH would win and it did, but I should have the data to back it for the next reviewer who asks.

/ footnotes

  1. [1]NHTSA, FMVSS No. 138, Tire Pressure Monitoring Systems. nhtsa.gov/laws-regulations/fmvss.
  2. [2]Geisberger, Sanders, Schultes, Delling. Contraction Hierarchies: Faster and Simpler Hierarchical Routing in Road Networks. algo2.iti.kit.edu.
  3. [3]Apache Kafka consumer documentation. kafka.apache.org.