What I Learned Trying to Make Local LLMs Write Applications
From simple prompts to a full engineering environment: 9B, 35B, agents, scaffolds, and context as currency.
I’ve been testing a simple idea for a while: can a local LLM be genuinely useful for developing personal projects? Not in the one-shot chat format where the model writes a function or explains an error, but in a more interesting mode: I have an app idea, the model helps break it down into steps, writes code, checks itself, fixes errors, and gradually brings the project to a usable state.
First I tried to understand the models themselves. Then generators. Then agent wrappers. Then my own workflows with packages, slices, reviews, sub-agents, and semi-automatic development. At some point it became clear that I was no longer testing one model, but the entire small engineering ecosystem around it.
A short map of experiments
| Stage | What I tried | What became clear |
|---|---|---|
| Synthetic tasks | Texts, sites, constraints, prompts with ChatGPT | Models easily fill in gaps and sound confident where they lack grounding |
| Generators & scaffolds | Quick Vue/Express starts, configs, templates, first drafts | Without a framework, the model quickly drifts into random decisions |
| 9B + agents | Qwen 9B, OpenCode, Pi.dev, Aider, first planner/implementer/reviewer roles | Small tasks work, long autonomy quickly breaks |
| Sub-agents & search | Websearch, codesearch, explore/review sub-tasks | Sub-agents help, but easily bloat context |
| Full-auto workflow | bootstrap, choose-task, implement, stabilize, finish, repair | Full automation is fragile, but stabilize turned out very useful |
| Scaffolds as lab | 10-ideal, Vue 3, Express 5, TS, ESLint, Prettier, API layer | A good starting structure matters more than long instructions |
| 35B local | Qwen3.6 35B A3B, long context, llama.cpp/Vulkan | For the first time, I got the feeling of a working local executor |
| Current approach | Templates, small slices, checks, review packets, external reviewer by need | The model works best inside a prepared corridor |
First came synthetic experiments
It started quite peacefully. I ran synthetic tasks, often with ChatGPT. We came up with prompts, checked how the model holds constraints, writes text, reacts to a task without ready material.
For example, there was an attempt at a personal developer-enthusiast website. The conditions were deliberately unpleasant: no fake repositories, articles, projects, clients, or achievements, but you still really want a beautiful site. I wanted to understand whether the model could make something convincing without a fake biography.
There, the first model habit surfaced quickly: they fill in gaps. Ask it not to use a portfolio, and it still reaches for familiar blocks: projects, cases, articles, achievements. Ask for living text, and you get smooth but plastic delivery. You have to strip out the pathos, empty generalizations, bureaucratic language, and neural-network drama.
Already at this level it became clear that a good result rarely comes from one magical prompt. You need small rules that bring the model back to reality.
The first rules were almost literary:
• don't make up facts
• don't hide emptiness behind beautiful words
• don't pretend the author already has a portfolio
• write simpler
• keep only what you can actually answer for
Later it turned out that exactly the same principle works in code. Only instead of beautiful phrases, the model starts making up architecture, rules, folders, configs, and meaning where it would be better to take a small, clear step.
Then came generators and first scaffolds
After text, I wanted more: quickly raise an app foundation. Not just ask the model to write a component, but give it a task like: create a minimal Vue + Express project, set up TypeScript, ESLint, Prettier, Tailwind, add a couple commands, make it all runnable.
And that’s when the real development part of the experiment began.
The model can impressively quickly create package.json, configs, folders, minimal frontend and backend. On a small demo, this looks almost magical. But ask for a fresh toolchain, and the model’s confidence starts to get in the way. It remembers old configs, mixes up new versions, sometimes fixes a problem in a way that creates a new one two steps later.
There was a very simple example. I asked to set up the latest Vue with the latest ESLint, only package.json, eslint.config.mjs, tsconfig.json, and .gitignore. No app. All on npm. At the end, npm run lint should run. Seems like a small task. But the agent immediately went into websearch for versions and configs.
And there’s no simple answer here. Search is really needed, because frontend tools change fast. But uncontrolled search quickly turns into a context devourer. For a local model, context is expensive. You have to guard it almost like RAM.
| Idea | Expectation | What happened in practice |
|---|---|---|
| Give the model a fresh frontend setup | It quickly assembles current configs | The model either drags old knowledge or starts searching a lot |
| Ask for only a few files | Context stays small | Even a small task can lead to websearch |
| Use a generator as a start | You’ll get a fast clean project | Without rules nearby, strange solutions appear |
| Make the perfect scaffold immediately | Set it up once, then live happily | The scaffold itself became a separate lab |
At this point I first started thinking not just about the model, but about the shape of the task. The less the model makes up around, the better. The clearer the boundaries, the less garbage in the result.
9B: quick hopes and a narrow corridor
Next I experimented a lot with Qwen 9B. It was tempting: lighter, faster, easier to fit in memory, doesn’t create the feeling that every launch turns your computer into separate infrastructure. For early agent experiments, this size looked reasonable.
9B can be useful. If you give it a small, clear chunk, it can fix a file, write a simple component, go through an obvious error, collect a small batch of changes. But as soon as a task requires holding architecture, fresh configs, a long chain of actions, or multiple layers of a project, things start to break down.
It can produce working code, and next to it lay down a strange structure. It can fix one thing and break another. It can confidently edit a config that it doesn’t fully understand itself.
| Model / mode | How it felt | Where it was useful | Where problems quickly started |
|---|---|---|---|
| Qwen 9B | Light, convenient for quick attempts | Small fixes, simple components, experiments | Architecture, long tasks, fresh configs |
| Qwen 9B as implementer | Fine, if you give a short packet | Local implementation of a small slice | Starts losing edges if the packet is too wide |
| Qwen 9B + remote planner/reviewer | Already looks like a workflow | Plan from outside, small implementation locally | Context transfer and repair packets require discipline |
| Qwen 9B + full-auto idea | Tempting on paper | Educational and semi-automatic runs | Full autonomy quickly becomes fragile |
So the next step wasn’t new prompts, but wrappers. I wanted the model to work not as one chat, but as a participant in a process.
OpenCode, Pi.dev, Aider, and the idea of roles
I tried different tools and schemes: OpenCode, Pi.dev, Aider, Codex, local and remote models, different roles. Gradually a scheme emerged with separation of responsibilities:
First this looked cumbersome. Then it became clear that there’s meaning in this cumbersomeness. It’s hard for a local model to be given a large task all at once. It’s better to pass it a small piece with clear boundaries and a specific check.
Parallel to this, the idea of a local orchestrator emerged. The main OpenCode chat on local Qwen manages sub-agents: remote slice planner, local implementer, remote reviewer, local repairer. In theory, the main agent holds state, and sub-tasks do the dirty work. Reality was less smooth, but the direction was useful.
My workflows quickly started growing bureaucracy
After ready tools, I started making my own templates. Commands appeared like bootstrap-project, choose-task, implement-slice, finish-packet, revert-packet. Then start-packet, work-packet, stabilize-packet, inspect, abort-packet. The idea was clear: each command runs in a new session, gets the minimum necessary context, does one stage, and writes artifacts.
The approximate chain looked like this:
bootstrap-project
↓
answers.md / project docs / current state
↓
choose-task or start-packet
↓
next-build-packet.md
↓
work-packet / implement-slice
↓
checks
↓
stabilize-packet
↓
finish-packet
It looked almost like a small app factory. In practice, the factory quickly started demanding care.
Documents repeated each other. README turned out to be useless. AGENTS.md grew and became worse. The model confused workflow files with the product. Internal folders like .workflow and .pi had to be separately protected from being perceived as part of the project. Commands with arguments turned out not so useful, because bootstrap and inspect still ask questions themselves. Some restrictions that initially seemed like protection only hindered the semi-automatic experiment.
| What I added to the workflow | Why it seemed like a good idea | What turned out |
|---|---|---|
| Lots of docs at start | The model will understand the project better | Documents start repeating and aging |
| Long AGENTS.md | More rules, fewer errors | Long rules are harder to read and harder to follow |
| Commands with arguments | Can run everything more precisely | In reality, commands often ask for what they need |
| Bootstrap the whole project | The model will have the full picture | The full picture quickly turns into extra context |
| Subagents for checks | The main flow will stay clean | Sub-agents can also burn a lot of context |
| Stabilize pass | Final cleanup | One of the most useful stages of the whole process |
The most useful stage turned out to be stabilize. When the model has already done something, it’s easier for it to look at the specific result, find gross errors, and bring the project into order. Full autonomous development from one idea remained fragile, but the cycle made a piece, checked it, stabilized it, committed it — that looked much closer to real usefulness.
Sub-agents: a good idea, if they don’t eat the whole house
Separately, I thought a lot about sub-agents. I wanted to relieve the main context. Let a separate agent go to websearch, codesearch, explore, review, gather information, and return a short verdict. Then the main flow won’t be clogged with search, logs, and intermediate garbage.
In theory, it sounds great.
In practice, I saw how a sub-agent accumulated about 59 thousand tokens of context, did several searches, worked for about twelve minutes, and then the unpleasant question arose: how much did all this really relieve the main process? Even if technically the sub-task completes separately, its final report easily becomes too long. And if the report is long, the main context puffs up again.
Good sub-agent:
• does one sub-task
• reads a limited set of files
• returns a short verdict
• gives paths and specific findings
• doesn't write an essay in the main chat
Bad sub-agent:
• explores everything randomly
• searches a lot
• quotes a lot
• brings a long report
• creates an illusion of progress, but clogs context
With websearch and codesearch, the story is similar. I added them to Pi agent, discussed how OpenSpec might use this, thought about whether to allow free search. In the end, free mode turned out to be dangerous. The model loves to search, because searching looks like work. A limit works better: go to the internet a couple of times, when you really need fresh versions, documentation, or real examples. For everything else, the local project context, short task packet, and checks are enough.
First live applications and unpleasant details
Parallel to workflows, I tried real applications. Sometimes the result already started to please. For example, there were auto-generated prototypes with dark liquid-glass UI, Active / Pinned / Trash tabs, modal editor, confirm dialog, SQLite sessions, backup/restore scripts, and LAN-friendly dev.
At some point, a local 35B was able to generate a wide working prototype, then fix a long TypeScript problem itself, and call an explore sub-agent. This was an important shift. Before this, agent schemes often looked like a fight with tools. Here, for the first time, there was a feeling that a local model could be an executor for a real personal project.
But small disasters didn’t go away anywhere.
In one editor, the model tried to add attributes to disable autocorrect on a mobile keyboard. The idea was normal: a specific editor went crazy from auto-replacement and started typing word fragments on the right. But the local model mixed up the brackets, and the application crashed with a syntax error in compiled JS. The fix was simple, but the episode itself shows the level of risk: the model can understand the intent, make almost the right edit, and still break the app with one stupid bracket.
Another example was with Sovka. After the redesign to dark liquid glass, the interface became noticeably better, but the model created a second element with id=“app” inside AppShell. The root app was already in index.html, and a second such id broke the app. Then I had to rename the inner shell to app-shell and add a guardrail: don’t create a second #app, the inner container should be called #app-shell or a class.
| Incident | What the model tried to do | How it ended | What guardrail appeared |
|---|---|---|---|
| Mobile autocorrect in editor | Add attributes to editor root | Broke brackets and JS build | Small UI edits also need verification through build |
| Second id=“app” in Sovka | Add inner shell | Broke the app due to root id conflict | #app only in index.html, shell is called #app-shell |
| Websearch for small setup | Find fresh versions | Context started bloating | Search only when needed and with a limit |
| Large docs in workflow | Help the model remember the project | Model confused workflow and product | Separate internal folders explicitly |
| External review without wiring context | Find error in diff | Reviewer could be wrong due to incomplete packet | Review packet must include wiring context |
These things are hard to catch with a beautiful system prompt. They’re better turned into short rules that grew from real disasters.
10-ideal: scaffold as a laboratory
An important stage was 10-ideal. I didn’t see it as the final notes app. It was a laboratory for Qwen, Pi, and OpenCode: pnpm workspace, Vue 3 + Vite + TypeScript on the frontend, Express 5 + TypeScript on the backend, strict TypeScript, ESLint, Prettier, Tailwind v4, Vue Router, API client layer, later ideas of Pinia, Zod, OpenAPI, SQLite, auth, tests, and CI.
There gradually appeared an important insight: a local model needs a good starting environment. Not a huge architectural encyclopedia, but a clean scaffold with clear folders, check commands, and a few short rules.
I started leaning more toward type-first structure, not feature-first. For the frontend, something like:
src/api
src/stores
src/router
src/layouts
src/views
src/components/ui
src/components/<domain>
src/styles
For the backend:
src/routes
src/services
src/repositories
src/validation
src/middleware
src/config
src/db
src/types
The model works better when it doesn’t have to reinvent the project shape every time. In an empty folder, it fantasizes. In a good scaffold, it more often continues an existing thought.
On the basis of this approach, living things started to appear. One example is a lightweight analog of pi-web for LAN/self-use: project folders management, creating new folders, list of Pi sessions through SDK, reader view, SSE events, session creation, guard against a second prompt in a running session, running/idle status through finally. There was still a lot of work to do, but the feeling was different. The model was no longer sticking random Vue components. It was working inside a form that could be continued.
Checks, reviews, and an unexpected lesson about external models
I had an interesting episode with review-collect and external review. The local model passed a task on the notes scaffold, and an external reviewer erroneously marked a blocker on POST /api/notes, because it didn’t see express.json() in the wiring context. After that, it became clear that diff-audit is useful, but the review packet must include not only the diff, but also important pieces of wiring and untracked files.
Interestingly, the local Qwen in that case gave a more useful review than the external model. This doesn’t mean external models are useless. Rather, they depend very much on the packet they were given. If the packet is incomplete, the reviewer starts confidently complaining about things that don’t actually exist in the project.
| Review format | What’s good | Where the risk is |
|---|---|---|
| Local review by Qwen on project | Sees more of the current context, can be practical | Can be too soft on its own solutions |
| External reviewer on diff | Fresh perspective, cheap critic, good for repair packet | Makes mistakes if the packet doesn’t contain wiring context |
| Full project dump | Lots of information | Expensive in context and often noisy |
| Deterministic review packet | Managed volume, easy to pass further | Need to know in advance which files matter |
After that, I started valuing not the model-reviewer itself, but the quality of the packet. A good packet is sometimes more important than the difference between models.
Transition to 35B
At some point it became clear that 9B was too tight for the main role. I tried different options: Qwen Coder 30B, Qwen 27B, Qwen2.5 Coder 14B, different quantizations. Some ran, but were too slow for a normal cycle. Qwen3 Coder 30B A3B Q3_K_L technically worked, but one audit-agent took about 700 seconds. Qwen3.5 27B on a full run reached tens of minutes. The model could be smarter than 9B, but the real workflow kills not only stupidity, but also latency.
A real shift happened with Qwen3.6 35B A3B in GGUF, especially with UD-Q4_K_M. On my setup with Ryzen 7 5700X, 32 GB RAM, and a GPU with 16 GB VRAM, it became the first large local model that could be seriously integrated into development.
In a good configuration, I saw about 22–25 tokens per second on generation and about 280–318 tokens per second on prompt processing. Sometimes long prompt eval was lower, but still tolerable. At the same time, memory immediately became part of the workflow: video memory can be about 14 out of 16 GB, RAM about 22 out of 32 GB, and every extra browser, VS Code, Docker, or second session is already felt.
| Model | How it felt in my experiments | Where it was useful | What held back |
|---|---|---|---|
| Qwen 9B | Fast and convenient for small tasks | Simple fixes, short implementations, early agents | Narrow corridor, weaker at holding architecture |
| Qwen Coder 30B A3B Q3_K_L | Technically launched | Check the idea of a large coder-model | Audit-agent about 700 seconds, too slow |
| Qwen3.5 27B | Smarter than small models, but heavy | Experiments with long runs | Tens of minutes on a full run |
| Qwen2.5 Coder 14B | Tried as an intermediate option | Separate coding tasks | Speed and quality didn’t give the needed balance |
| Qwen3.6 35B A3B UD-Q4_K_M | The first option that feels like a working local executor | Agent coding, review, feature slices, scaffold-development | RAM, VRAM, long context, generation cost |
With 35B, the ceiling changed. The model holds the task better, understands architectural notes better, works more calmly with a large project, and more confidently fixes its own errors. But old problems didn’t disappear. Long context is expensive. Full reprocessing feels painful. If the agent starts reading a lot, searching a lot, and writing a lot in chat, speed drops, and the session becomes heavy.
I thought I’d be choosing models. Instead, I started discussing batch, ubatch, KV cache, prompt cache, context at 65k or 81k, how much RAM to squeeze from the browser, how not to kill the system, what to do with Docker Desktop, how much to leave for VS Code. Somewhere at this stage, local LLM development stopped being only about prompts.
Context turned out to be a separate currency
Long context was a fundamental topic for me. Without it, agent development quickly falls apart: the model forgets documents, loses the current packet, doesn’t see the architecture, repeats old solutions. But you can’t use large context as a junk drawer.
The more I experimented, the more I wanted to push details into files and artifacts, and give the main flow a short packet. If the model writes a huge report in chat, it harms the next step. If the reviewer makes a short repair packet, it can be passed to the local executor without extra noise.
What started working better:
• long details live in files
• the main chat gets a short packet
• review returns findings, paths, verdict
• implementer doesn't write a long report
• checks run after each step
• formatting is done on specific files
• the next step doesn't read the whole history again
Context, memory, and speed gradually became part of one system. On paper, you can say: just get a smarter model. In practice, you have to ask differently: how much does it eat, how much does it generate, how much does it hold, how often can you run checks, how painful is it to restart a session, can you simultaneously hold a browser and VS Code.
Where remote models are still useful
After the transition to 35B, remote models didn’t disappear. Rather, their role became calmer. Free Laguna and other OpenRouter models are useful as an external planner, critic, or reviewer. Especially when you need to look at a solution from the outside.
But giving them the whole project every time is inconvenient. A one-time external review works better: collect a review packet, give it to the model, get specific notes, return a short repair packet to the local executor.
In the end, the external reviewer became not the main brain, but an extra pair of eyes. The main work still comes down to the shape of the task.
What changed in expectations
If you look back, expectations changed several times.
First, it seemed that the right prompt was needed. Then it seemed that the right agent was needed. Then the right workflow. Then it became clear that you need an environment: templates, checks, small packets, clear commands, short rules, a normal way to transfer state between sessions.
My current rules for local LLM development
What currently looks most workable:
• a small slice instead of a large task
• a good scaffold instead of an empty folder
• a short AGENTS.md instead of an encyclopedia
• checks after each step
• stabilize pass before the next stage
• websearch only when needed
• review packet instead of a huge dump
• formatting only touched files
• details in artifacts, not in chat
• one heavy local executor at a time
Another conclusion: you need different templates. One scaffold for a Vue + Express app. Separate for a semi-static personal site. Separate for CLI. Possibly separate for a small backend-only service. Each should have working check commands, a clear structure, minimal docs, and a few rules born from real breakages.
Where local models are already useful
| Task | 9B | 35B |
|---|---|---|
| Small fix | Good | Good |
| Simple feature | Tolerable | Good |
| Component in a ready scaffold | Fine | Good |
| Architectural plan | Weak | Fine |
| Reviewing your own code | Limited | Useful |
| Working with a large context project | Quickly becomes tight | Real, but expensive |
| Long autonomous development | Bad | Still risky |
| Stabilize pass | Sometimes useful | Very useful |
| Working without scaffold | Often chaos | Better, but still risky |
| Working in a good scaffold | Fine | Very useful |
In the end
The most pleasant thing about all this experience: progress is really there. Early experiments often looked like a fight with beautiful chatter. Then came the fight with configs, memory, agents, permission prompts, bloated documents, and sessions with tens of thousands of tokens. Now a local 35B can already make a piece of a real application on a good scaffold in a few hours. With bugs, with review, with human oversight, but it’s already working material.
The main skill turned out not to be writing the perfect prompt. It’s much more important to learn to build an environment around the model where it’s hard to do stupid things and easy to do useful work.
A local LLM in this format is like a worker in a personal workshop. It needs tools, a workbench, clear templates, a short task, and result verification. If you throw a whole idea at it, it will easily make a mess. If you give it a good scaffold, a small slice, and a normal set of checks, it starts being useful.
For me, that became the main conclusion. A local model can already be part of development. But value appears not from one model, but from everything around it: templates, packets, reviews, checks, limited search, careful context, and an honest attitude toward its weak spots.