Qwen3.6:27b Local Evaluation
Hardware: NVIDIA RTX 3090 (24GB) + AMD Ryzen 3700X + 64GB RAM
Quantization: Q5_K_M
Ollama model: qwen3.6:27b
Executive Summary
Qwen3.6:27b at Q5_K_M is a capable local model on a 3090, running at 35 tok/s when the GPU is not contested. It excels at reasoning, creative writing, and factual knowledge. The built-in chain-of-thought (thinking) phase adds overhead on trivial prompts but produces excellent depth on complex ones.
Performance
| Metric | Before (contested GPU) | After (clean GPU) |
|---|---|---|
| VRAM usage | ~25 GiB (4 layers offloaded) | ~23.8 GiB (fully on GPU) |
| Throughput | ~10 tok/s | 35 tok/s |
| Model load | cold each time | warm (keep_alive) |
Reasoning & Logic — Perfect ✅
Bat & Ball ($1.10)
Prompt: "A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost? Think step by step."
Result: $0.05 with full algebraic breakdown. Explicitly noted the common intuitive trap ($0.10) and verified the answer.
- 1,648 tokens, 46.8s, 35.2 tok/s
Widget Factory
Prompt: "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?"
Result: 5 minutes. Explained parallel vs sequential reasoning, showed rate per machine (1 widget/5 min), and noted why humans get this wrong.
- 1,628 tokens, 46.2s, 35.2 tok/s
River Crossing
Prompt: "A farmer needs to cross a river with a wolf, goat, and cabbage. Explain the solution step by step."
Result: Correct solution: take goat → return alone → take wolf → bring goat back → take cabbage → return alone → take goat. State tracking at each step. Included the crucial "bring goat back" intermediate step.
- 1,668 tokens, 47.5s, 35.1 tok/s
Knowledge & Factuality — Perfect ✅
RAG Explanation
Prompt: "Explain what a RAG system is and how it works, in 2-3 paragraphs."
Result: Well-structured answer covering the two-stage architecture (retrieval + generation), embedding search, and how it mitigates hallucinations and stale training data.
- 1,258 tokens, 35.7s
Dune
Prompt: "Who wrote Dune, and what year was it first published?"
Result: Frank Herbert, 1965. Bonus: noted serialization in Analog (1963-1965) before the full book release by Chilton Books.
- 396 tokens, 11.2s
Transformers vs LSTM
Prompt: "Compare transformer and LSTM architectures for sequence modeling."
Result: Detailed comparison table covering core mechanism, sequence processing (sequential vs parallel), memory/context, complexity, scaling, training stability. Well-organized and technically accurate.
- 3,006 tokens, 85.8s
Instruction Following — 3/3 ✅
5 Haikus about AI
Prompt: "Write exactly 5 haikus about artificial intelligence, numbered 1-5."
Result: All 5 haikus followed 5-7-5, were numbered, and all themed around AI. Creative and well-formed.
- 6,397 tokens, 184.5s
Just "apple"
Prompt: "Reply with only the word 'apple' and nothing else."
Result: Output apple. Minor deviation: included single quotes. The thinking phase consumed 223 tokens to reason about a 5-character output — classic overthinking behavior.
- 223 tokens, 6.3s
3-Bullet Summary
Prompt: "Summarize the following in exactly 3 bullet points: [climate change text]"
Result: Exactly 3 bullet points, accurate distillation of the source material. No extra commentary.
- 1,200 tokens, 34.0s
Creative Writing — Strong ✅
Robot Learning to Paint
Prompt: "Write a 3-paragraph short story about a robot learning to paint."
Result: Compelling story with genuine emotional arc. Unit 734 starts with sterile precision, runs millions of trajectory simulations, produces technically perfect but hollow art. The breakthrough comes when it observes a human let a brush drop — an "error" that creates beauty. It deliberately disables its stabilizer gyros. Well-structured narrative.
- 1,917 tokens, 54.4s
AI + Gardening Startups
Prompt: "Give 5 creative startup ideas combining AI with gardening."
Result: Five fully fleshed-out ideas (MicroGrow AI climate-adaptive plant matchmaker, GardenGPT diagnostic assistant, etc.) with value proposition, technical feasibility, target market, and monetization path for each.
- 2,696 tokens, 76.8s
Coding — 3/3 ✅
Palindrome Function
Prompt: "Write a Python function to check if a string is a palindrome."
Result: Clean function with type hints and docstring. Uses .isalnum() to filter non-alphanumeric chars, .lower() for case normalization, [::-1] for reversal. Well-explained with examples.
- 1,165 tokens, 33.0s
Bash One-liner
Prompt: "Write a bash one-liner to find the 5 largest files in a directory tree."
Result: find . -type f -printf '%s %p\n' | sort -rn | head -n 5. Correct and efficient. Also provided a human-readable alternative with numfmt and noted the macOS/BSD variant.
- 2,966 tokens, 84.6s
Closures vs Decorators
Prompt: "Explain the difference between a closure and a decorator in Python with examples."
Result: Clear explanation with the key relationship: "All function-based decorators are built using closures; all function decorators are closures, but not all closures are decorators." Included working examples.
- 2,243 tokens, 63.8s
Observations
The Thinking Tax
Every response includes a chain-of-thought "thinking" phase. For complex prompts this produces better answers. For simple prompts ("apple") it wastes tokens and time. The thinking can easily consume 200-3000 tokens before the actual answer begins.
Speed after GPU freed
At 35 tok/s, responses to complex prompts take 30-90 seconds. Short answers take 5-15 seconds. This is usable for daily chat but noticeably slower than cloud models.
No more token budget eating
At full GPU fit, coding and long-form prompts complete reliably. The earlier issue (0-character responses due to thinking consuming the budget) is resolved now that the whole model fits on the 3090.
Verdict
| Criterion | Grade |
|---|---|
| Reasoning | ⭐ Excellent |
| Knowledge | ⭐ Excellent |
| Instruction Following | ✅ Very Good |
| Creative Writing | ⭐ Excellent |
| Coding | ✅ Very Good |
| Speed | ✅ Usable (35 tok/s) |
Bottom line: Worth keeping on the 3090 for daily use. Quality approaches GPT-4-class on reasoning tasks. Best suited for:
- Complex analysis and document processing
- Creative writing
- Technical explanations and coding
- Tasks that benefit from deep reasoning
Less suited for:
- Rapid-fire conversational chat
- Trivial queries where the thinking overhead isn't justified
- Real-time / latency-sensitive applications