Tools·Jun 18, 2026

Qwen 0.8B sets local summarization floor on 6GB GPU; Granite 350M hallucinates

We evaluate Qwen 0.8B and Granite 350M for local meeting summarization on a 6GB GPU, detailing performance, VRAM usage, and critical context window adjustments for practical use. The Answer Up Front…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 18, 2026·4 min read·1 source

We evaluate Qwen 0.8B and Granite 350M for local meeting summarization on a 6GB GPU, detailing performance, VRAM usage, and critical context window adjustments for practical use.

The Answer Up Front

For developers with a 6GB GPU seeking local meeting summarization, Qwen 3.5:0.8b offers a viable, albeit slow, solution. It successfully produces structured summaries with a Modelfile adjustment to num_ctx. Skip Granite 4.0 350M for this workload; despite its speed, it consistently hallucinates, rendering its output unusable for factual summarization. The bottom line is that sub-1B models can deliver coherent structured summaries locally, but careful model selection and context management are critical.

Methodology

v0 review draws on the founder raww2222's published claims on Reddit; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

This review covers the performance of two small language models—Qwen 3.5:0.8b and Granite 4.0 350M—for local meeting summarization. The testing was conducted by founder raww2222 using a custom application designed for local dictation, transcription, and summarization. The specific test rig was an RTX 3060 Laptop with 6GB VRAM, running Ollama 0.23 on Arch Linux. Approximately 4.3GB of VRAM was available after faster-whisper loaded. The input for summarization was a real 4-minute meeting transcript, approximately 2900 characters long. A key part of the methodology involved a Modelfile adjustment for Qwen 3.5:0.8b, explicitly setting PARAMETER num_ctx 16384 to overcome Ollama's default VRAM-aware context limit of 4096 tokens. This review focuses on the founder's reported inference times, VRAM consumption, and output quality for structured summaries. It does not cover independent performance verification, long-term workflow integration, or edge cases beyond the described meeting transcripts.

What It Does

Local-first Audio Processing

Founder raww2222 developed an open-source application, available as a single .exe on Windows and .AppImage on Linux, to keep audio processing entirely local. The tool provides hotkey-triggered dictation using faster-whisper for local transcription, pasting text directly at the cursor. This addresses privacy concerns by ensuring audio data does not leave the machine for transcription.

Meeting Recording and Summarization

Version 1.6.0 of the application introduced a meetings recorder, capturing both microphone and system audio into a single stereo file. This file is then transcribed locally. For summarization, the application allows users to point to an arbitrary endpoint, including local Ollama or llama.cpp instances, or cloud services like Groq and OpenAI. The only network call is for the optional summary, with the user controlling the destination.

Small Model Benchmarking

The founder used this application as a testbed for evaluating mini-models on real-world meeting transcripts. The primary focus was on qwen3.5:0.8b (873M, Q8_0) and Granite 4.0 350M. The goal was to identify a working floor for coherent, structured summarization on a 6GB GPU, specifically looking for models that could produce TL;DRs, decisions, action items, and open questions without excessive VRAM or hallucination.

What's Interesting / What's Not

The most interesting finding is the demonstrated capability of qwen3.5:0.8b to produce a coherent, structured summary from a 2900-character meeting transcript on a 6GB GPU. The founder reports it streamed a 1562-character summary in 57 seconds, consuming 2.2GB of VRAM. This performance, while not fast, establishes a practical baseline for local, low-VRAM summarization. The critical insight here is the Modelfile fix: explicitly setting PARAMETER num_ctx 16384 was necessary to prevent Ollama's VRAM-aware defaults from limiting the context window, which would otherwise lead to truncated or incomplete reasoning.

In stark contrast, Granite 4.0 350M proved unsuitable for this task. Despite its superior speed, completing summaries in 0.6 to 2.8 seconds, it exhibited severe hallucination. On a transcript about Anthropic acquiring Bun, Granite invented

The investor read

The demand for local-first AI solutions, especially for sensitive data like meeting transcripts, signals a growing market for edge computing and privacy-preserving tools. This review highlights the challenges and successes of running small LLMs on consumer-grade hardware, a key enabler for this trend. The clear distinction between a usable (Qwen 0.8B) and an unusable (Granite 350M) sub-1B model underscores the importance of quality over sheer size in the small model space. Investors should watch for specialized small models that prioritize coherence and structured output, even at the cost of speed, for specific tasks like summarization. The founder's application, while open-source, demonstrates a clear product need and could inform future commercial ventures focusing on local AI agents. The name collision with voiceflow.com is a minor branding risk for any future commercialization.

Sources · how we verified

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

Local-first Audio Processing

Meeting Recording and Summarization

Small Model Benchmarking

What's Interesting / What's Not

The investor read

Google Sheets as a $0/month contact form backend for static sites

Rust GUI memory reduction: Ditching GPU for CPU rendering

Leptos and WASM for Micro-SaaS: A Performance-Focused Review