work record · Sep 2024 – Dec 2024
Ford Motor Company
Software Engineering Intern · Waterloo, ON
What I owned
The firmware team owns the connectivity layer of Ford's infotainment platform. Modem firmware, the cellular stack, Wi-Fi/BT handoff, and the glue that exposes it to the rest of the head unit. To validate every build, the team runs a fleet of bench rigs (a head unit wired to a modem, a SIM, and an RF chamber) replaying real drive cycles overnight.
Every rig pushes structured telemetry into Kafka[1]: modem state transitions, AT command traces, packet loss windows, signal quality, thermal counters, exception traces. Busy weeks hit ~8M events/day. Two things hurt. When a rig crashed overnight, the firmware engineer who owned the build burned 30 to 60 minutes scrolling Kibana before they could even guess at a cause. And the existing alerting paged on any modem drop over 5 seconds, which meant it paged constantly on known-flaky RF chambers. People stopped trusting the pager.
Slack bot for 'why did rig N crash?'
The brief was simple. Let me ask Slack what happened and get a real answer. The hard part was making the answer trustworthy enough that a senior firmware engineer would act on it without re-deriving the whole thing themselves.
The bot is a FastAPI service behind a Slack slash command. The interesting half is the offline pipeline that keeps a per-rig, per-session view of the Kafka stream queryable.
When the bot fires it does the boring, important thing first. Resolve the time window (default: last crash for that rig), pull the digest, pull the top-k log excerpts in that window, then call the LLM. The model never sees the raw 8M-event firehose. It sees a constrained packet.
The prompt is structured. Every hypothesis has to cite specific excerpt IDs from the retrieved set.[2] If the response doesn't parse against the schema, the bot retries once, then falls back to showing the raw excerpts instead of guessing.
ROOT_CAUSE_SCHEMA = {
"type": "object",
"required": ["rig_id", "time_window", "log_excerpts",
"root_cause_hypotheses"],
"properties": {
"rig_id": {"type": "integer"},
"time_window": {
"type": "object",
"required": ["start", "end"],
"properties": {
"start": {"type": "string", "format": "date-time"},
"end": {"type": "string", "format": "date-time"},
},
},
"log_excerpts": {
"type": "array",
"items": {
"type": "object",
"required": ["ts", "source", "line"],
},
},
"root_cause_hypotheses": {
"type": "array", "minItems": 1, "maxItems": 3,
"items": {
"type": "object",
"required": ["summary", "confidence",
"supporting_excerpt_ids"],
},
},
},
}Root-cause JSON schema. Every claim must cite excerpt IDs.
Slack renders the cited excerpts as expandable log lines, so the engineer sees the evidence and not just the conclusion. That one constraint is what moved the bot from "novelty" to "people actually use it." Mean investigation time on rig crashes dropped from about 45 minutes to about 4 over the last six weeks of the internship.
LSTM modem-dropout detector
Pager fatigue was a different problem. The signal was in the data. Connectivity drops have real precursors (RSRP slope, retransmit clusters, thermal creep) but a static threshold can't tell a real dropout from a planned RF-chamber attenuation step.
I pulled 68,032 connectivity traces from the previous quarter of regression runs. A trace is a 90-second window of per-100ms modem telemetry leading up to a candidate event. Labels came from the rig owner's post-hoc triage notes: 11,204 positives (real dropouts), 56,828 negatives (benign, planned attenuation, known-flaky chamber). Split 70/15/15, stratified by rig and by build so the model never trained on traces from the same rig-build pair it was evaluated on. That mattered. An earlier random split inflated val precision by ~6 pp through rig-identity leakage.
Started with a 1D CNN since that's the obvious move on fixed-length multivariate windows. It hit ~74% precision and plateaued. The failure mode was telling. It kept missing dropouts where the precursor was a slow drift across the full 90s, exactly where a CNN's local receptive field hurts you. A two-layer LSTM[3] with a small attention head over the sequence handled those long-horizon precursors and pushed precision past the CNN ceiling. Recall stayed roughly flat across architectures (~0.82). The gain was almost entirely in precision, which is the metric that maps to pager pain.
Precision moved from 0.71 on the old threshold system to 0.88 on held-out test. False-alert volume on the on-call channel fell 41% week-over-week after rollout. The team tracked weekly pager-hours and averaged about 22 fewer hours/week of paged time. Two of the three rotation members stopped getting paged on weekends entirely.
What I learned
Three honest notes. One: most of the value in both projects came from data plumbing, not modelling. The indexing pipeline and the leakage-aware label split moved more numbers than any architecture choice. Two: forcing structured output with cited evidence is the single biggest thing that makes engineers trust an LLM tool. I'd build every future copilot this way. Three: the LSTM is good but it can't yet tell you which feature drove a prediction, and on-call has started asking. A small SHAP or attention-rollout pass over the deployed model is the obvious next step. I left a written handoff for it.
/ footnotes
- [1]Apache Kafka, consumer API. kafka.apache.org/documentation. ↩
- [2]Anthropic, Messages API. Structured output and tool-use patterns used for schema-constrained responses. docs.anthropic.com/messages. ↩
- [3]PyTorch
nn.LSTMdocumentation. pytorch.org/docs/nn.LSTM. ↩