/ projects

blog record — Apr 2026

Trade Reasoning Agent

An LLM that explains the why behind politician and insider trades, with a second pass that tries to break its own answer.

stack
ClaudeExaEDGARNext.jsFirebase
01context

What it does

Quiver, Capitol Trades, and Unusual Whales already list politician trades. None of them answer the question a portfolio manager actually cares about: why, and is it any good?

The naive version is easy to get wrong in three ways. Ask Claude open-endedly "why did Senator X buy NVDA on March 3?" and you get a clean paragraph about AI tailwinds that is mostly invented. Add retrieval and the top results for politician plus ticker queries are SEO farms that get laundered into "analyst sentiment." Worst, if the news query pulls from after the filing date the model writes a thesis that "predicts" what already happened. Most casual LLM-trading demos accidentally do that. The pipeline is mostly defenses against those three.

02architecture

System overview

The hard rule, drawn as the dashed line in Fig 1. Anything above it uses only data with available_at <= t_filing. Below it can peek at future prices, because that's the backtester's job.

fig 01 — chartfiling-time invariant
filing timepollws fan-outfeedEDGARForm 4 / PTR01 / ingestFiling Ingestnormalize ticker, side, size · dedupe vs Firestore02 / retrievalContext BuilderExa neural search (published_before t_f)domain trust + 8-K · earnings · committee schedule03 / claude · json-modeThesis Generator→ { driver, evidence[], confidence, ... }04 / claude · adversarySelf-Check Criticscore evidence quality, circularityreject / accept / request more context05 / paper traderBacktester · Live Paper Traderfiling-date-aware features · Firestore positions
fig 01End-to-end pipeline. Everything above the dashed line runs on filing-time information only.
03data

Three feeds

EDGAR Form 4 and PTRs. Form 4 covers corporate insiders and lands within two business days[1]. Congressional PTRs are messier. STOCK Act[4] says 30 days, hard ceiling 45. Most arrive at the ceiling, sometimes later, often as scanned PDFs. A parser normalizes both into a shared Filing doc in Firestore with two timestamps: t_filing (when it became public) and t_trade (when it executed). Single most important pair of values in the project.

Exa for news. Bing News is cheaper and Google Programmable Search has more coverage. I picked Exa because the neural search lets me query semantically ("evidence NVDA datacenter demand was strengthening before 2026-03-03") and its published_before filter actually works[2]. On top of that I keep a per-domain trust score: Reuters, Bloomberg, WSJ, FT, and company 8-Ks score high. Seeking Alpha contributor posts sit in the middle. Content farms score zero and get filtered out.

SEC 8-Ks and earnings transcripts. Free, structured, and usually the actual cause of any interesting institutional trade. Ranked above news when both are available.

04prompt

Thesis generation

The thesis prompt is the most-iterated artifact in the codebase. Early versions asked Claude[3] open-endedly to "explain why this trade likely happened." Pretty prose, no structure. The current prompt does four things.

1. Forces JSON against a strict schema so downstream code can score theses. 2. Provides evidence first and the question last, so Claude sees the Exa context block before the filing. Small change, big drop in ticker-anchored confabulation. 3. Requires inline citation IDs for every claim in evidence[]. No citation, auto-reject before the critic sees it. 4. Asks for a falsifiable counter-signal: "what would have to be true for this thesis to be wrong?"

json// excerpt
{
  "filing_id": "PTR-2026-03-03-XYZ",
  "ticker": "NVDA",
  "side": "BUY",
  "thesis": {
    "primary_driver": "DATACENTER_DEMAND_INFLECTION",
    "summary": "Filer increased NVDA exposure ahead of expected Q1 datacenter revenue beat...",
    "evidence": [
      { "claim": "Hyperscaler capex guidance raised", "cite": "ex_004", "weight": 0.4 },
      { "claim": "Supply constraint easing per 8-K", "cite": "ex_011", "weight": 0.3 }
    ],
    "confidence": 0.62,
    "horizon_days": 45,
    "counter_signal": "If hyperscaler capex commentary on next earnings reverses, thesis is invalidated."
  },
  "context_window": { "earliest": "2026-01-15", "latest_inclusive": "2026-03-03" }
}

The thesis schema Claude must emit.

latest_inclusive is the contract with the backtester. Any evidence with a published_at later than that timestamp and the whole thesis gets dropped.

05critic

The self-check loop

First version was a single critic prompt: "Here is a thesis. Score it 1 to 10." It scored everything a 7.

The current critic is structured as an adversary. Its job is to break the thesis, not evaluate it. Does every evidence item have a citation that actually loads? Is any evidence from a domain with trust under 0.5? Is the thesis circular, meaning does it cite the filing itself or news that only exists because of the filing? Is the counter_signal observable or a tautology? Is confidence calibrated against the weight-sum of evidence?

python// excerpt
def self_check(thesis, context):
    critique = claude.complete(
        system=CRITIC_SYSTEM_PROMPT,
        user=render_critic_prompt(thesis, context),
        response_format="json",
    )

    if critique["circularity_flag"]:
        return Reject("circular: thesis derived from filing-induced coverage")

    weak = [e for e in thesis["evidence"]
            if domain_trust(e["cite"]) < 0.5]
    if len(weak) / max(len(thesis["evidence"]), 1) > 0.34:
        return Reject("evidence majority from low-trust domains")

    if not critique["counter_signal_is_observable"]:
        return Reject("counter-signal not falsifiable")

    if abs(thesis["confidence"] - critique["recomputed_confidence"]) > 0.25:
        return Revise(suggested_confidence=critique["recomputed_confidence"])

    return Accept(score=critique["adversary_score"])

The critic gate. Reject early, fail loud.

fig 02 — chart−38% rejected · 51% → 58%
A 100-thesis batch streamed through the adversary critic. The critic rejects 38, sends 15 back for revision, and accepts 47. The downstream hit rate lifts from 51% to 58%, a smaller accepted set but materially better.100 theses · pre-criticcritic verdictdownstream pooladversarycriticclaude · higher Trejected0revised0accepted0hit rate51%58%+7ppsmaller set · better signal
fig 02Adversarial critic on a 100-thesis batch. Smaller accepted set, materially better hit rate.

About 38% get rejected outright, another ~15% are sent back for one revision, and the remaining ~47% reach the paper trader. Pre-critic hit rate was 51%. Post-critic, 58%.

06backtest

Filing-date-aware backtesting

Politicians disclose late. Form 4 insiders disclose less late but still not in real time. Use any information dated after t_filing, including price action between t_trade and t_filing, and the backtest is contaminated and the paper returns are fiction.

The guard, applied to every feature x_i used to construct or score a thesis:

(leakage guard)
fig 03 — chart7 candidates · 4 admitted · ≤ t_filing
The evidence pinning gate. Exa returns seven candidate documents. Each is placed on a time axis at its published_at date relative to t_filing. The vertical barrier at t_filing drops any document published after the filing, so the model only sees the four admissible excerpts.evidence pinning · published_before = t_filingexa candidates · gate · admittedexa returns 7 candidate docs · gate accepts 47 admittedadmissiblefuture · droppedt−30dt−20dt−10dt_filingt+5dt_filingreutersbloombergcompany 8-kwsjseo-farmdroppedredditdroppedblog postdroppedclaude only sees the 4 admitted excerpts
fig 03Evidence-pinning gate. Documents published after t_filing are dropped before Claude sees them.

τ(x_i) is when feature x_i first became public. The second clause is the definition of disclosure lag, true by construction. The first is enforced in three places. Evidence pinning: every Exa query carries published_before = t_filing, so later docs get dropped before Claude sees them. Price-feature lag: rolling indicators (e.g. 20-day momentum) compute on [t_filing - 20d, t_filing], never on [t_trade - 20d, t_trade], since the latter is 30 days of free lookahead. Entry simulation: simulated entry is the open on the next trading day after t_filing, not the politician's fill price. What a real follower could have done.

fig 04 — chartsharpe 3.2 → 1.6
Apparent Sharpe of 3.2 before fixing two leakage bugs (an Exa client caching by query string and a Firestore index sorted descending but read ascending). After the fix, the honest Sharpe settles at 1.6. The shrinking bar is the entire story.before bugfix · two silent leaksafter bugfix · honest sharpebugexa cachequery-string · future docsbugfirestore indexsort desc, read ascwas 3.2before · sharpe 3.2after · sharpe 1.60.01.01.62.03.03.2believabletoo good to be truesharpe ratio1.6 sharpehonest baseline > flattering fantasy
fig 04Two leakage bugs were inflating Sharpe by ~2×. Honest backtester after the fix.
07results

Paper-trading results

Over a 90-day paper window (Jan to Mar 2026):

+12.0% vs SPY at +4.1%. Sharpe ~1.6. Hit rate 58% on 141 closed positions. Average hold 31 days, set by the thesis' horizon_days, with early exit when the counter_signal triggers. Largest drawdown was 6.4%, mostly a cluster of accepted theses around a regional bank that was wrong about the rate-cut path.

fig 05 — chartn=141 · sharpe 1.6
-4%0%+4%+8%+12%wk1wk2wk3wk4wk5wk6wk7wk8wk9wk10wk11wk12wk13drawdown −6.4%strategyspy benchmark
fig 05Paper-trading equity curve vs SPY, Jan–Mar 2026 (90-day window, n=141 closed positions).

The strategy was meaningfully positive on committee-aligned trades (Armed Services members trading defense names) and roughly flat on broad-market index trades. Alpha sits in informational asymmetry, not in copying directional bets[5].

Senate trades did not outperform House trades, which is the opposite of the conventional "Pelosi Tracker" framing. In my window, House PTRs that survived the critic had slightly higher hit rates. I don't have a clean explanation. Best guess is selection, since more House filers means more independent signals after filtering.

08reflection

What I'd do differently

Both passes are Claude, so they share priors. I want to run the critic on a non-Anthropic model and watch rejection rates shift. Position sizing is equal-weight capped at 2% NAV; a Kelly-style sizer keyed off the critic's recomputed_confidence is the obvious next step, but I don't trust my confidence calibration yet. About 4% of PTRs come through as scanned PDFs from older filers' offices and my parser drops them, which probably hides some of the most interesting trades. And 90 days isn't a backtest, it's a demo. I want at least two years of out-of-sample data before any real money goes near this, which means a historical Exa-snapshot corpus. That's its own project.

/ footnotes

  1. [1]SEC Form 4 filing requirements and the two-business-day rule: sec.gov/forms/form4data and EDGAR full-text search at efts.sec.gov.
  2. [2]Exa neural search with date filters and content retrieval: docs.exa.ai/reference/search.
  3. [3]Anthropic Claude API, JSON-mode and tool use patterns used for the thesis schema and critic loop: docs.anthropic.com/structured-outputs.
  4. [4]STOCK Act disclosure rules and PTR timing requirements (House and Senate): ethics.house.gov and ethics.senate.gov.
  5. [5]Ziobrowski et al., "Abnormal Returns from the Common Stock Investments of the U.S. Senate," Journal of Financial and Quantitative Analysis. Foundational paper on politician-trade alpha and the basis for the "informational asymmetry" framing. jstor.org.