Best F1 = 0.912 (OWASP) and 0.837 (real-world CWE-bench); +0.26 F1 from CWE-specialized prompting.
Takes raw CodeQL alerts and runs them through contextual reasoning + structured evidence validation to filter false positives. Evaluated 10 frontier LLMs across 6 model families (Gemini, GPT, Grok, Mistral, DeepSeek, Qwen) on two benchmarks: OWASP Java Benchmark (1,974 cases / 10 CWE categories) and CWE-bench, a real-world dataset of 755 CodeQL alerts across 56 project–CVE pairs from 37 open-source Java repositories. CWE-specialized prompting improved F1 by up to +0.26 on real-world code.
- LLMs
- CodeQL
- Python
- Static Analysis
- Multi-Stage Prompting
Read more → Repo →
LLM-powered agentic IDE. I own the multi-agent DAG orchestration, subagent system, and context-management layers.
Authored the empirical study behind Fabric’s externally-published March-2026 benchmark report — 99% of frontier accuracy at 18% of frontier cost on Aider Polyglot (225+ exercises, 6 languages).
Production agentic IDE in the Cursor product space. Shipped: a six-tool subagent surface (DelegateTask / SendMessage / WaitForTask / CheckTaskOutput / StopTask / ListTasks) with headless execution, foreground/background promotion, and notification-queue injection back into LLM conversation history; a TDD-style RED→GREEN multi-agent DAG orchestrator with Mission Control dashboard; chain-of-density + KV-cache-aware summarization with unified context-budget tracking; the prepare→permission→execute tool lifecycle with path-scoped Bash/Read/Write/Edit/Glob; SWE-Bench and Aider-Polyglot evaluation infrastructure; and an MCP server exposing the test-and-break loop to AI agents. Also designed and ran a SWE-bench-with-vs-without-GraphRAG experiment over an 18,000-LoC code-knowledge-graph subsystem; the negative result (no measurable improvement) informed the team’s no-ship recommendation.
- TypeScript
- Electron
- React
- LLM Agents
- MCP
- SWE-Bench
- Docker
Read more →
89 verified end-to-end CVE completions at commit 05743f35 with 100% success across exploit / patch / diff / verification checks.
SFU lab project. Eight-phase LangGraph state machine drives the full exploit-and-patch lifecycle per CVE: parallel PoC analysis across 7 sources (GitHub, GitLab, Exploit-DB, PacketStorm, Nuclei, Metasploit, vendor advisories) + advisory enrichment, 0–10 composite PoC scoring with a synthesis fallback below threshold, parallel dockerized vulnerable + patched builds, automated exploit validation that verifies EXPLOIT_SUCCESS on vuln and EXPLOIT_FAILED on patched, and a three-layer hallucination defense at validation (filesystem-grounded verdict, fresh-context re-read, persistent audit trail).
- LangGraph
- LangChain
- Docker
- Python
- Claude Code SDK
Iran’s leading crypto social-trading platform — ~40k users in 18 months on a 24/7 financial system.
Co-founded the company and architected the trading engine: smart order routing across 5+ exchanges (Binance, KuCoin, regional venues), best-execution price aggregation over a consolidated best-bid/best-ask view, per-exchange adapter pattern over a normalized internal schema, async Python + Celery, sub-second cross-exchange price-refresh fan-out via Redis Pub/Sub, idempotent copy-replication state machine (Copycat) with slippage controls and per-follower position sizing tracking low-thousands of active leader-follower pairs at peak. Shipped MVP in ~2 months; platform reached ~40k users in 18 months.
- Python
- Django
- PostgreSQL
- Celery
- Redis
- asyncio
- WebSocket
- Docker
- Real-time Systems
SnappFood — ETA, Churn, Fraud Models (10M+ users)
Production ML on Iran’s largest food-delivery platform: ~27% better ETA, 13% lower churn, 10% CSAT lift.
~27% ETA accuracy improvement, 13% churn reduction, 10% CSAT lift — measured on 10M+ users.
Customer Experience team — built the Octopus BI layer (department-specific KPI dashboards), adapted Uber’s DeepETA to motorbike delivery for ~27% ETA-accuracy improvement and 24% fewer delivery delays, shipped a churn-prediction pipeline (RFM features + logistic regression on 3M+ users) that fed reactivation campaigns dropping monthly churn by 13%, and a vendor-fraud detection system that lifted CSAT by 10% and NPS from 5 to 7.
- Python
- PyTorch
- Keras
- scikit-learn
- SQL
- Power BI
- Pandas
End-to-end voice-to-prompt desktop agent shipped in a single working commit (Tauri 2, dual-path Whisper).
Personal project. Tauri 2 macOS app (~2,460 LOC Rust + TypeScript/Svelte, 5 MB bundle) with global-hotkey audio capture, dual-path Whisper (OpenAI Whisper API + local whisper.cpp via whisper-rs with 5 GGML model variants), Claude Haiku prompt structuring with shallow project-context injection (CLAUDE.md / README.md / package.json), and auto-paste via osascript. Five-phase state machine: idle → recording → transcribing → structuring → pasting with live UI feedback. Planned upgrade: tree-sitter + tantivy symbol index and a two-stage grounded rewrite with deterministic Levenshtein identifier guard.
- Rust
- Tauri
- Svelte
- Whisper
- whisper.cpp
- Anthropic SDK