Boomerang Inc. logo

work record · Jan 2026 – Present

Boomerang Inc.

Software Engineering Intern · New York, NY

stack
KotlinPyTorchFastAPIpgvectorRedisClaude
role
Software Engineering Intern
01context

What Boomerang is

Boomerang is alumni search for recruiters. Think LinkedIn Recruiter, except the graph is your company's alumni and their second-degree connections. My job over the internship was pretty open-ended: make recruiters faster at the whole loop. Finding people, reaching out, scoring matches, and helping our own team ship fast enough to keep up. I ended up working on all four.

02search

NLP search over 2M alumni

The old search was a form with 300+ filters. To find people who just left FAANG you had to know to set "previous company = Meta" and "tenure 2+ years" and "departure year = 2023." Most recruiters never figured that out.

So I embedded all 2M alumni profiles into one dense vector each (title and company history, location, education, bio) and stored them in Postgres with pgvector[1] on HNSW indexes. A small rewriting step turns a query like "founders who left FAANG in 2023" into hard filters (role: founder, prior company in FAANG, year: 2023) plus a leftover semantic vector. The filters run in SQL, the vector does the similarity search, and a reranker stitches the list together.

fig 01 — chartnatural language → audience
search · 2M alumni
query-rewrite
hard filters · sql
prior_company ∈{ Stripe, Square }seniority ≥StafflocationNew York, NYdeparture_year2023
residual vector · dim 1536
~ "fintech founder, building now"
pgvector · HNSWANN top-200rerank
audience2,041,883 312 matches
fig 01Plain-English query → filters + vector → a real audience

The quality came from training on misses. When a recruiter ran a search, skipped the top hits, and clicked someone on page 3, that click is a positive and everything above it is a negative. Two months of that loop closed most of the gap with the old hand-tuned filters, and outreach kept climbing.

fig 02 — chartoutreach/seat/month
020040060080010001200nov '25dec '25jan '26feb '26mar '26apr '26may '26outreach per seatpre-launch trend
fig 02Recruiter outreach per seat after the NLP search rollout
03ranker

Two-stage neural retriever

The "candidates for this role" feed ran on a gradient-boosted ranker over hand-built features like title match, tenure, and school tier. It worked fine, but acceptance, meaning the recruiter actually reached out, was stuck around 8%.

I replaced it with a two-stage neural retriever trained on 12M past candidate-job pairs (accepted = positive, dismissed in under 5 seconds = hard negative). Stage one is a bi-encoder that maps candidates and jobs into the same space, fast enough to score the whole pool. Stage two is a cross-encoder that re-ranks the top 200 by reading the job and the profile together. Trained in PyTorch, served behind FastAPI with batched inference.[2]

fig 03 — chart12M pairs · top-200 cut
Two-stage retriever: a bi-encoder scores the full candidate pool and keeps the top 200; a cross-encoder re-ranks those into a final shortlist.inputjob desctop 200stage 1 · bi-encoder · scores the poolstage 2 · cross-encoder · re-ranks
fig 03Bi-encoder scores the pool, cross-encoder re-ranks the top 200
python// synced to diagram
def score_candidates(job, pool):
job_vec = bi_encoder.encode_job(job) # (d,)
cand_vecs = bi_encoder.encode_candidates(pool) # (N, d)
coarse = cand_vecs @ job_vec # (N,)
top200 = pool[np.argpartition(-coarse, 200)[:200]]
fine = cross_encoder.score_pairs(job, top200) # (200,)
return top200[np.argsort(-fine)]

Two-stage scoring, highlighted as the diagram runs

Acceptance, meaning the recruiter actually reached out, went from about 8% on the old ranker to 31% with bi-encoder plus cross-encoder. That worked out to roughly 4x more good hires per search. The cross-encoder is the expensive part, so the 200-cap earns its keep. Bumping it to 500 added less than a point of acceptance for 2.5x the latency, so I left it.

04dev velocity · ops

The software factory

Before I shipped any product, I rebuilt how we shipped product. Every dev step (triage, branch setup, scaffolding, review, PR descriptions, tests) got a Claude[3] entrypoint in our internal CLI. Tickets carry real context: linked Notion docs, Granola call transcripts, past PRs on the same files. The factory hands that context to Claude with the right prompt for each step. Closed tickets per cycle went from about 14 to 31, and review acceptance didn't drop. Most of the win wasn't faster code generation. It was context plumbing. Claude is only as good as the smallest useful slice of repo and docs you can hand it.

The clearest place to watch the factory run is the in-app bug widget. A recruiter hits a bug, clicks the little widget in the corner, and types one sentence. Context attaches itself (URL, last 5 API calls, user role, feature flags) and the pipeline takes over: a triage classifier sorts it, an agent drafts and writes the fix, tests and QA run, and it lands in the on-call channel for an engineer to confirm before it merges.

fig 04 — chartreport → triage → fix → qa → merge
A user reports a bug through an in-app widget; it is triaged, an agent writes the fix, QA runs, an engineer confirms, and it merges.report a bug×↵ sendbug #4827automated fix pipeline01triageclassify · route02agent fixclaude writes patch03qatests · checks04reviewengineer confirms05mergedshipped
fig 04One bug report, triaged and fixed mostly without an engineer

70% of bugs now close without an engineer touching anything, and on-call pages are down about 80%. The failure I worry about is a confident wrong patch on a bug that looks familiar but isn't. It happened twice in review, so I added a novelty score against the embedding index of past bugs to force human triage on anything new.

05infra

Cost and perf wins

HR sync, 52 min to 3 min. Our biggest customers push 200K-employee snapshots every night, and the old job re-fetched and re-enriched every single record. I added a Redis dirty-set keyed on (employee_id, source_etag) so we only touch rows that changed. The full sync dropped from 52 minutes to about 3, a 17x speedup, and we stopped tripping the source API's rate limits on Mondays.

fig 05 — chart200K rows · nightly
A nightly sync funnels 200K rows through a dirty-set that drops 194K unchanged ones; the survivors are batched 50 at a time, and each row lookup probes a cache hierarchy. Process memory (small, projected to 3 of 12 columns) absorbs about 90% of reads. Redis (shared, same projection) absorbs about 9%. Postgres holds the full 12-column source of truth and is almost never touched. The same job that took 52 minutes per-row finishes in 3 minutes — a 17× speedup driven by dirty-set, batching, projection, and layered caching.intake · nightlysource200K rowsdirty-setskip 194Kbatch50 / callper-row readmemoryprocess · 3 of 12 cols~90% hitsreturnedmissredisshared · 3 of 12 cols~9% hitsreturnedmisspostgressource · 12 of 12 cols<1%cold fetchnightly sync52 → 3 min17× speedup
fig 05Dirty-set sync processes only the rows that changed

OpenAI bill down 60%. Two things. Field normalization ("Sr. SWE II" to "Senior Software Engineer," "MSFT" to "Microsoft") went from one LLM call per row to a few-shot prompt doing 50 rows at a time. And the nightly enrichment now sends only deltas against the last run, with cache hits served from Redis. Same accuracy on our eval set, way smaller invoice.

fig 06 — chartmonthly openai cost · baseline=100
The OpenAI bill, indexed at 100, drops as batching, delta-only calls, and Redis cache hits each knock cost off, ending near 40 (a 60% cut).baseline 100 · one llm call per row100cost− 36 batch 50 rows / call− 16 delta-only enrichment− 8 redis cache hits60% less
fig 06Where the 60% OpenAI savings come from
06open problems

What I'd do differently

The bug widget shipped before I had a good way to measure bad auto-PRs in prod. I was tracking merged vs rejected, not merged-then-reverted-within-30-days, which is the number that actually matters. We added it, just later than I wanted. The search reranker is also still one model per locale, and I think per-customer adapters would beat it, but I ran out of time to prove it. And I leaned on Claude for code review more than I should have. It approves subtly wrong refactors in test files more often than you'd expect, so humans still need to read test diffs closely.

/ footnotes

  1. [1]pgvector, open-source vector similarity for Postgres — github.com/pgvector/pgvector.
  2. [2]Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering — origin of the bi-encoder + cross-encoder pattern. arxiv.org/abs/2004.04906.
  3. [3]Anthropic, Claude API documentation docs.anthropic.com.