Hi, I'm

Mohsen Iranmanesh

M.Sc. Computing Science @ SFU · Research Engineer Intern @ Farpoint

I build agentic LLM systems that ship — and write papers about them. Currently working on the agent layer of an LLM-powered IDE at Farpoint, and on LLM-driven static analysis and CVE reproduction under Dr. Mohammad Tayebi at SFU.

Research

I work on agentic AI systems for software engineering — specifically, on pipelines that combine LLMs with static-analysis tooling for vulnerability detection, triage, and remediation. M.Sc. Computing Science at SFU under Dr. Mohammad Tayebi.

ZeroFalse: Improving Precision in Static Analysis with LLMs

Under review RAID 2026 2026

Mohsen Iranmanesh, Sina Moradi Sabet, Sina Marefat, Ali Javidi Ghasr, Allison Wilson, Iman Sharafaldin, Mohammad A. Tayebi

A multi-stage LLM pipeline that takes raw static-analyzer alerts and triages them through contextual reasoning and structured evidence validation, reducing false positives without sacrificing recall. Evaluated 10 frontier LLMs across 6 model families (Gemini, GPT, Grok, Mistral, DeepSeek, Qwen) on the OWASP Java Benchmark (1,974 cases / 10 CWE categories) and CWE-bench — a real-world dataset of 755 CodeQL alerts across 56 project–CVE pairs from 37 open-source Java repositories. CWE-specialized prompting improved F1 by up to +0.26 on real-world code; best F1 is 0.912 on OWASP and 0.837 on CWE-bench.

First-author submission, currently under review at RAID 2026.

Full list and project pages on Research.


Selected work

Engineering projects from internship work, founding-team builds, and personal experiments. A two-layer read: the headline claim is for fast scans; click into a project to see architecture, tradeoffs, and eval.

ZeroFalse

Multi-stage LLM pipeline that reduces false positives in static analysis.

Best F1 = 0.912 (OWASP) and 0.837 (real-world CWE-bench); +0.26 F1 from CWE-specialized prompting.

Takes raw CodeQL alerts and runs them through contextual reasoning + structured evidence validation to filter false positives. Evaluated 10 frontier LLMs across 6 model families (Gemini, GPT, Grok, Mistral, DeepSeek, Qwen) on two benchmarks: OWASP Java Benchmark (1,974 cases / 10 CWE categories) and CWE-bench, a real-world dataset of 755 CodeQL alerts across 56 project–CVE pairs from 37 open-source Java repositories. CWE-specialized prompting improved F1 by up to +0.26 on real-world code.

  • LLMs
  • CodeQL
  • Python
  • Static Analysis
  • Multi-Stage Prompting

Fabric — Agentic IDE (Farpoint)

LLM-powered agentic IDE. I own the multi-agent DAG orchestration, subagent system, and context-management layers.

Authored the empirical study behind Fabric’s externally-published March-2026 benchmark report — 99% of frontier accuracy at 18% of frontier cost on Aider Polyglot (225+ exercises, 6 languages).

Production agentic IDE in the Cursor product space. Shipped: a six-tool subagent surface (DelegateTask / SendMessage / WaitForTask / CheckTaskOutput / StopTask / ListTasks) with headless execution, foreground/background promotion, and notification-queue injection back into LLM conversation history; a TDD-style RED→GREEN multi-agent DAG orchestrator with Mission Control dashboard; chain-of-density + KV-cache-aware summarization with unified context-budget tracking; the prepare→permission→execute tool lifecycle with path-scoped Bash/Read/Write/Edit/Glob; SWE-Bench and Aider-Polyglot evaluation infrastructure; and an MCP server exposing the test-and-break loop to AI agents. Also designed and ran a SWE-bench-with-vs-without-GraphRAG experiment over an 18,000-LoC code-knowledge-graph subsystem; the negative result (no measurable improvement) informed the team’s no-ship recommendation.

  • TypeScript
  • Electron
  • React
  • LLM Agents
  • MCP
  • SWE-Bench
  • Docker

Golden Repository — Verified, Executable CVE Reproductions

LangGraph-orchestrated agentic pipeline that reproduces and patches CVEs end-to-end. 89 verified completions (61 Python + 28 Java).

89 verified end-to-end CVE completions at commit 05743f35 with 100% success across exploit / patch / diff / verification checks.

SFU lab project. Eight-phase LangGraph state machine drives the full exploit-and-patch lifecycle per CVE: parallel PoC analysis across 7 sources (GitHub, GitLab, Exploit-DB, PacketStorm, Nuclei, Metasploit, vendor advisories) + advisory enrichment, 0–10 composite PoC scoring with a synthesis fallback below threshold, parallel dockerized vulnerable + patched builds, automated exploit validation that verifies EXPLOIT_SUCCESS on vuln and EXPLOIT_FAILED on patched, and a three-layer hallucination defense at validation (filesystem-grounded verdict, fresh-context re-read, persistent audit trail).

  • LangGraph
  • LangChain
  • Docker
  • Python
  • Claude Code SDK

Pabla — Crypto Social-Trading Engine

Real-time copy-trading engine for crypto markets. Iran’s leading platform in the space — ~40k users in 18 months.

Iran’s leading crypto social-trading platform — ~40k users in 18 months on a 24/7 financial system.

Co-founded the company and architected the trading engine: smart order routing across 5+ exchanges (Binance, KuCoin, regional venues), best-execution price aggregation over a consolidated best-bid/best-ask view, per-exchange adapter pattern over a normalized internal schema, async Python + Celery, sub-second cross-exchange price-refresh fan-out via Redis Pub/Sub, idempotent copy-replication state machine (Copycat) with slippage controls and per-follower position sizing tracking low-thousands of active leader-follower pairs at peak. Shipped MVP in ~2 months; platform reached ~40k users in 18 months.

  • Python
  • Django
  • PostgreSQL
  • Celery
  • Redis
  • asyncio
  • WebSocket
  • Docker
  • Real-time Systems

SnappFood — ETA, Churn, Fraud Models (10M+ users)

Production ML on Iran’s largest food-delivery platform: ~27% better ETA, 13% lower churn, 10% CSAT lift.

~27% ETA accuracy improvement, 13% churn reduction, 10% CSAT lift — measured on 10M+ users.

Customer Experience team — built the Octopus BI layer (department-specific KPI dashboards), adapted Uber’s DeepETA to motorbike delivery for ~27% ETA-accuracy improvement and 24% fewer delivery delays, shipped a churn-prediction pipeline (RFM features + logistic regression on 3M+ users) that fed reactivation campaigns dropping monthly churn by 13%, and a vendor-fraud detection system that lifted CSAT by 10% and NPS from 5 to 7.

  • Python
  • PyTorch
  • Keras
  • scikit-learn
  • SQL
  • Power BI
  • Pandas

Clarion — Voice-to-Prompt Desktop Agent

Tauri 2 macOS menu-bar agent: hotkey → Whisper → Haiku rewrite → paste. Built for bilingual developers.

End-to-end voice-to-prompt desktop agent shipped in a single working commit (Tauri 2, dual-path Whisper).

Personal project. Tauri 2 macOS app (~2,460 LOC Rust + TypeScript/Svelte, 5 MB bundle) with global-hotkey audio capture, dual-path Whisper (OpenAI Whisper API + local whisper.cpp via whisper-rs with 5 GGML model variants), Claude Haiku prompt structuring with shallow project-context injection (CLAUDE.md / README.md / package.json), and auto-paste via osascript. Five-phase state machine: idle → recording → transcribing → structuring → pasting with live UI feedback. Planned upgrade: tree-sitter + tantivy symbol index and a two-stage grounded rewrite with deterministic Levenshtein identifier guard.

  • Rust
  • Tauri
  • Svelte
  • Whisper
  • whisper.cpp
  • Anthropic SDK

All projects — including research infrastructure and smaller experiments — on Projects.


About

I'm a software & ML engineer with a research line. Eight years of production engineering across fintech, food-delivery (10M+ users), and developer tools — and a publication track in applied LLM systems for software security. Right now I'm at Farpoint, where I own the multi-agent DAG orchestration, subagent system, and context-management layers of an LLM-powered agentic IDE, and at SFU where I'm finishing my M.Sc. thesis on LLM-driven vulnerability remediation.

I'm targeting full-time AI Engineer / ML Engineer / SWE / AI Researcher / ML Researcher roles starting in late 2026. If you're building agentic systems, LLM evaluation infrastructure, code-intelligence tooling, or research-product engineering — I'd love to chat. Email is the fastest way.