Multimodal Search Stack: Performance for Indie Projects with Modal and L40S
This review analyzes the real-world performance of a multimodal semantic search stack, focusing on Modal, L40S, and Qwen3-VL-Embedding for cost-sensitive, scale-to-zero indie projects. TL;DR Best…
This review analyzes the real-world performance of a multimodal semantic search stack, focusing on Modal, L40S, and Qwen3-VL-Embedding for cost-sensitive, scale-to-zero indie projects.
TL;DR
Best for: Indie projects requiring multimodal semantic search over moderately sized datasets (e.g., up to 70k items) where scale-to-zero economics are critical and cold start latency can be managed through active warming. Skip if: Your application demands sub-second cold start times without external warming mechanisms, or if you require significantly larger vector dimensions and cannot tolerate increased search latency. Bottom line: The combination of Modal, L40S, and Qwen3-VL-Embedding provides a performant and cost-effective foundation for multimodal search, particularly when paired with Cloudflare R2's zero-egress pricing.
METHODOLOGY
This v0 review draws on the founder's published claims and detailed performance metrics shared on Reddit. Independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.
The stack under review comprises several key components: Modal (for serverless GPU serving), an NVIDIA L40S GPU (the hardware accelerator), Qwen3-VL-Embedding-2B (the multimodal embedder), FAISS (for vector indexing), Cloudflare R2 (for storage), and a Next.js frontend on Vercel. The review covers the performance observed by the founder, fm23moemoe, for a multimodal semantic search application over 68,816 artworks from the National Gallery of Art's open-access collection. Data was observed on 2026-06-04.
Specifically, we cover the reported cold and warm start times, server-side latency for embedding and FAISS search, vector dimension tradeoffs, and the impact of Cloudflare R2's pricing model. What is not covered in this initial review includes independent verification of performance claims, long-term workflow integration challenges, or edge cases beyond the NGA dataset. The founder's experience, as a UK-based master's student with a finance background, offers a perspective from outside traditional CV engineering, highlighting practical lessons learned.
WHAT IT DOES
This project implements a multimodal semantic search engine, allowing users to query 68,816 artworks using either text prompts (e.g., "a Vermeer with afternoon light") or image uploads. It returns 64 visually similar results, ranked by multimodal similarity.
Multimodal embedding with Qwen3-VL-Embedding
The core of the system is the Qwen3-VL-Embedding-2B model, which converts both text and image inputs into a single vector representation. The founder uses a 1024-dimension vector for indexing, derived from a 2048-dimension base, balancing semantic richness with performance considerations.
Efficient vector indexing using FAISS
Vector search is handled by FAISS, utilizing a flat inner product on L2-normalized vectors. The index size is approximately 250MB on disk, enabling rapid similarity searches over the 68,816 artwork embeddings. Server-side FAISS latency is reported between 0.9ms and 2.4ms per search.
Serverless GPU serving via Modal
The backend runs on Modal, leveraging an L40S GPU for inference. Modal's scale-to-zero capability is crucial for managing costs in a side project. The founder reports warm request times of 0.4-2.0s end-to-end (mean 1.3s, median 1.45s), with server-side processing for embedding and FAISS taking 35-41ms. Cold starts, however, are a significant factor, with first-cold requests taking 43.8s end-to-end.
Cost-effective storage with Cloudflare R2
Metadata, the FAISS index, and a thumbnail cache are stored in Cloudflare R2. The founder explicitly states that R2's zero-egress pricing is a primary enabler for the project's viability, allowing for cost-effective data storage and retrieval without punitive bandwidth fees.
WHAT'S INTERESTING / WHAT'S NOT
What's interesting here is the founder's detailed, transparent reporting of real-world performance metrics and the practical strategies employed to mitigate common challenges in serverless GPU deployments. The explicit acknowledgment of cold start as "the real enemy" and the solution of using a Vercel cron ping every 7 minutes to keep a container warm is a valuable, actionable insight for other developers. This directly addresses a known pain point for scale-to-zero services. The specific breakdown of server-side latency (34-39ms for the embedder, 0.9-2.4ms for FAISS) provides a clear picture of where compute time is spent. The use of IIIF endpoints for on-demand, normalized thumbnails is a smart optimization, avoiding the need to pre-process and store multiple image sizes. The founder's background (finance, not CV engineering) lends weight to the "hard-won lessons," suggesting these are practical, not theoretical, findings.
What's less interesting, or rather, what's missing from the founder's pitch, includes deeper technical rationale for choosing Qwen3-VL-Embedding-2B over other multimodal models. While the choice of a 1024-dimension vector is explained by performance tradeoffs, a comparative analysis with higher or lower dimensions on recall quality would add value. The description of pagination on a flat index as "awkward" is noted, but further detail on alternative approaches or the specific limitations encountered would be beneficial. The specific version of FAISS used is not mentioned, which can impact reproducibility or comparison with other benchmarks.
PRICING
- Modal: Utilizes Modal's scale-to-zero capabilities on an L40S GPU. Specific pricing for the L40S on Modal is not provided, but the founder's ability to run this as a side project implies a cost-effective model, especially with the scale-to-zero feature. Costs are incurred only when the container is active. An 8-minute scaledown window is noted.
- Cloudflare R2: Zero-egress pricing is highlighted as critical for project viability. This means no charges for data transfer out of R2.
- Vercel: Used for frontend hosting (Next.js) and cron jobs to keep Modal containers warm. Vercel has a free tier for hobby projects, with paid tiers for higher usage.
Pricing snapshot: 2026-06-04.
VERDICT
This multimodal search stack, built around Modal, L40S, and Qwen3-VL-Embedding, offers a compelling solution for indie developers and side projects. It delivers strong warm-start performance (0.4-2.0s end-to-end) and highly efficient server-side processing (35-41ms), making it suitable for applications where user experience after the initial load is paramount. The critical factor for adoption is managing the 43.8s cold start latency. For projects that can implement active warming strategies, such as the Vercel cron ping described, this stack becomes highly viable. Cloudflare R2's zero-egress pricing further enhances its appeal for cost-sensitive deployments. We recommend this stack for projects with datasets up to ~70k items that prioritize operational cost efficiency and can tolerate or mitigate cold start delays.
WHAT WE'D TEST NEXT
Our next steps would involve independently verifying the reported cold and warm start times, as well as the server-side latency figures, across a range of query complexities and image types. We would also benchmark the stack's performance with significantly larger datasets, scaling beyond 68,816 artworks, to understand its behavior under increased load and index size. A comparative analysis of Qwen3-VL-Embedding-2B against other leading multimodal embedders (e.g., CLIP, various open-source alternatives) would be crucial, evaluating both embedding quality (recall) and inference latency. Finally, we would investigate the long-term cost implications of the Vercel cron ping strategy for maintaining warm containers under varying traffic patterns, alongside exploring alternative cold-start mitigation techniques.
Every claim ties to a primary source. See our methodology.