work record · May 2025 – Aug 2025
Ford Motor Company
Software Engineering Intern · Waterloo, ON
Shipping in something with wheels
First time writing software that ships in a thing with wheels. Two things made it harder than any prior internship.
Real-time C++ has no escape hatch. A web service that takes 200ms longer is a Datadog alert. A tire-pressure ISR that takes 200ms longer can fail a federal compliance test, and missing that gate after freeze costs millions and a quarter of revenue. So you instrument first, hypothesize second, change code third.
Review is heavy on purpose. Anything touching a safety path needs two reviewers from the embedded platform team, a static-analysis pass (Coverity + MISRA-C++), and a hardware-in-the-loop run on the bench. The auto-triage project came out of pure frustration with that loop.
TPMS interrupt path
Tire-Pressure Monitoring is federally mandated under FMVSS 138.[1] Each wheel has a battery-powered sensor that radios pressure to the body control module. Pressure drops, dashboard lights up. The reg specifies a detection window, not per-event latency, but slow per-event latency eats into that window once you stack averaging filters on top.
Three things were eating budget. One: the ISR ran at low priority and could get preempted by the OBD-II diagnostics handler. Two: inside the ISR the 64-byte sensor frame was memcpy'd into a heap-allocated buffer before being queued. Fine on x86, painful on the MCU because the allocator briefly takes an IRQ-disabling lock. Three: a 5 Hz polling loop pulled queued frames into the CAN transmit task, adding up to 200ms of jitter for no reason.
I bumped the TPMS IRQ to the same priority tier as airbag-deploy notifications (cleared with safety, documented in the FMEA update). Killed the ISR-side memcpy with a lock-free ring buffer of pre-allocated 64-byte slots. ISR writes the frame index, CAN task reads it. Replaced the polling loop with an event-driven wake on the producer index.
End-to-end landed at 612ms on the bench, validated across temperature corners (−40 °C to +85 °C) and confirmed on the HIL rig. Passed FMVSS 138 with margin before freeze.
Fuel-efficient routing
The 2027 F-150 ICE lineup and the Mach-E share a routing subsystem. Spec was a re-ranker that takes the top-K candidate routes from the existing nav engine and re-scores them on energy cost instead of time or distance.
Edge weights aren't constants. They depend on vehicle state.
// edge cost = energy proxy in joules, lower is better
double cost(const Edge& e, const VehicleState& v, Time t) {
return alpha * distance(e)
+ beta * grade(e) * v.mass
- gamma * regen_credit(e, v) // EV only
+ delta * traffic_prior(e, t)
+ eps * hvac_load(e, t);
}Per-edge cost. γ is zero for ICE; δ is reweighted on F-150.
The Mach-E gets a regen term (downhill recovers ~30% of kinetic energy through the motor). The F-150 doesn't, so γ collapses to zero and δ gets reweighted. Elevation came from the existing tile cache, traffic priors from a 90-day rolling per-segment ETA table.
Prototyped with A* because the heuristic (great-circle distance × min-cost-per-km) is admissible and easy to reason about. For prod I switched to contraction hierarchies[2] since the graph is mostly static between OTA map updates, so preprocessing amortizes. Query latency on the head unit's ARM Cortex-A53 stayed under 80ms for 300km routes.
The 41% overall came mostly from the Mach-E. Regen credit makes "longer but downhill" genuinely cheaper, which the old time-optimal ranker would never pick.
Auto-triage on 230 test rigs
We had 230 hardware-in-the-loop rigs running nightly. Each rig emitted ~60 events/hr, ~14K aggregate. The status quo was a Slack firehose and 40 firmware engineers manually skimming for their failures every morning.
I wrote a classifier that consumes the Kafka stream[3], tags each failure by signature, and routes it to the owner. Core is a priority-ordered regex and heuristic table. Fancy ML wasn't needed because the failure modes are narrow.
enum class FailureKind { Crash, Leak, Threading, Other };
FailureKind classify(const TestEvent& e) {
const auto& log = e.tail_log; // last 4KB
if (log.contains("SIGSEGV") || log.contains("assertion failed"))
return FailureKind::Crash;
if (e.heap_delta_kb > 256 && e.duration_s > 60)
return FailureKind::Leak;
if (log.contains("deadlock") || log.contains("TSAN: data race") ||
e.thread_count_peak > e.thread_count_baseline * 3)
return FailureKind::Threading;
return FailureKind::Other;
}Failure classifier. Ordering matters.
Routing used git blame on the failing test file plus a CODEOWNERS lookup. Median PR review dropped from 6h 20m to 3h 25m. Most of the win was just the right person seeing the failure in 5 minutes instead of next morning.
What I'd do differently
I underspecified the failure modes for triage routing. When git blame landed on a refactor commit, the wrong person got paged. I added a 2nd-best fallback late in the term, but the right answer is probably blame-by-line-range weighted by recency. TPMS passed HIL but I never got to see field telemetry post-freeze, and I'd push harder next time to ride along on the validation fleet. On the routing ranker, I never benchmarked honest A* at production scale. I assumed CH would win and it did, but I should have the data to back it for the next reviewer who asks.
/ footnotes
- [1]NHTSA, FMVSS No. 138, Tire Pressure Monitoring Systems. nhtsa.gov/laws-regulations/fmvss. ↩
- [2]Geisberger, Sanders, Schultes, Delling. Contraction Hierarchies: Faster and Simpler Hierarchical Routing in Road Networks. algo2.iti.kit.edu. ↩
- [3]Apache Kafka consumer documentation. kafka.apache.org. ↩