work record · Jan 2026 – Present
Boomerang Inc.
Software Engineering Intern · New York, NY
What Boomerang is
Boomerang is alumni search for recruiters. Think LinkedIn Recruiter, except the graph is your company's alumni and their second-degree connections. My job over the internship was pretty open-ended: make recruiters faster at the whole loop. Finding people, reaching out, scoring matches, and helping our own team ship fast enough to keep up. I ended up working on all four.
NLP search over 2M alumni
The old search was a form with 300+ filters. To find people who just left FAANG you had to know to set "previous company = Meta" and "tenure 2+ years" and "departure year = 2023." Most recruiters never figured that out.
So I embedded all 2M alumni profiles into one dense vector each (title and company history, location, education, bio) and stored them in Postgres with pgvector[1] on HNSW indexes. A small rewriting step turns a query like "founders who left FAANG in 2023" into hard filters (role: founder, prior company in FAANG, year: 2023) plus a leftover semantic vector. The filters run in SQL, the vector does the similarity search, and a reranker stitches the list together.
The quality came from training on misses. When a recruiter ran a search, skipped the top hits, and clicked someone on page 3, that click is a positive and everything above it is a negative. Two months of that loop closed most of the gap with the old hand-tuned filters, and outreach kept climbing.
Two-stage neural retriever
The "candidates for this role" feed ran on a gradient-boosted ranker over hand-built features like title match, tenure, and school tier. It worked fine, but acceptance, meaning the recruiter actually reached out, was stuck around 8%.
I replaced it with a two-stage neural retriever trained on 12M past candidate-job pairs (accepted = positive, dismissed in under 5 seconds = hard negative). Stage one is a bi-encoder that maps candidates and jobs into the same space, fast enough to score the whole pool. Stage two is a cross-encoder that re-ranks the top 200 by reading the job and the profile together. Trained in PyTorch, served behind FastAPI with batched inference.[2]
def score_candidates(job, pool): job_vec = bi_encoder.encode_job(job) # (d,) cand_vecs = bi_encoder.encode_candidates(pool) # (N, d) coarse = cand_vecs @ job_vec # (N,) top200 = pool[np.argpartition(-coarse, 200)[:200]] fine = cross_encoder.score_pairs(job, top200) # (200,) return top200[np.argsort(-fine)]Two-stage scoring, highlighted as the diagram runs
Acceptance, meaning the recruiter actually reached out, went from about 8% on the old ranker to 31% with bi-encoder plus cross-encoder. That worked out to roughly 4x more good hires per search. The cross-encoder is the expensive part, so the 200-cap earns its keep. Bumping it to 500 added less than a point of acceptance for 2.5x the latency, so I left it.
The software factory
Before I shipped any product, I rebuilt how we shipped product. Every dev step (triage, branch setup, scaffolding, review, PR descriptions, tests) got a Claude[3] entrypoint in our internal CLI. Tickets carry real context: linked Notion docs, Granola call transcripts, past PRs on the same files. The factory hands that context to Claude with the right prompt for each step. Closed tickets per cycle went from about 14 to 31, and review acceptance didn't drop. Most of the win wasn't faster code generation. It was context plumbing. Claude is only as good as the smallest useful slice of repo and docs you can hand it.
The clearest place to watch the factory run is the in-app bug widget. A recruiter hits a bug, clicks the little widget in the corner, and types one sentence. Context attaches itself (URL, last 5 API calls, user role, feature flags) and the pipeline takes over: a triage classifier sorts it, an agent drafts and writes the fix, tests and QA run, and it lands in the on-call channel for an engineer to confirm before it merges.
70% of bugs now close without an engineer touching anything, and on-call pages are down about 80%. The failure I worry about is a confident wrong patch on a bug that looks familiar but isn't. It happened twice in review, so I added a novelty score against the embedding index of past bugs to force human triage on anything new.
Cost and perf wins
HR sync, 52 min to 3 min. Our biggest customers push 200K-employee snapshots every night, and the old job re-fetched and re-enriched every single record. I added a Redis dirty-set keyed on (employee_id, source_etag) so we only touch rows that changed. The full sync dropped from 52 minutes to about 3, a 17x speedup, and we stopped tripping the source API's rate limits on Mondays.
OpenAI bill down 60%. Two things. Field normalization ("Sr. SWE II" to "Senior Software Engineer," "MSFT" to "Microsoft") went from one LLM call per row to a few-shot prompt doing 50 rows at a time. And the nightly enrichment now sends only deltas against the last run, with cache hits served from Redis. Same accuracy on our eval set, way smaller invoice.
What I'd do differently
The bug widget shipped before I had a good way to measure bad auto-PRs in prod. I was tracking merged vs rejected, not merged-then-reverted-within-30-days, which is the number that actually matters. We added it, just later than I wanted. The search reranker is also still one model per locale, and I think per-customer adapters would beat it, but I ran out of time to prove it. And I leaned on Claude for code review more than I should have. It approves subtly wrong refactors in test files more often than you'd expect, so humans still need to read test diffs closely.
/ footnotes
- [1]pgvector, open-source vector similarity for Postgres — github.com/pgvector/pgvector. ↩
- [2]Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering — origin of the bi-encoder + cross-encoder pattern. arxiv.org/abs/2004.04906. ↩
- [3]Anthropic, Claude API documentation — docs.anthropic.com. ↩