What I Learned Trying to Make Local LLMs Write Applications

I’ve been testing a simple idea for a while: can a local LLM be genuinely useful for developing personal projects? Not in the one-shot chat format where the model writes a function or explains an error, but in a more interesting mode: I have an app idea, the model helps break it down into steps, writes code, checks itself, fixes errors, and gradually brings the project to a usable state.

First I tried to understand the models themselves. Then generators. Then agent wrappers. Then my own workflows with packages, slices, reviews, sub-agents, and semi-automatic development. At some point it became clear that I was no longer testing one model, but the entire small engineering ecosystem around it.

A short map of experiments

Stage	What I tried	What became clear
Synthetic tasks	Texts, sites, constraints, prompts with ChatGPT	Models easily fill in gaps and sound confident where they lack grounding
Generators & scaffolds	Quick Vue/Express starts, configs, templates, first drafts	Without a framework, the model quickly drifts into random decisions
9B + agents	Qwen 9B, OpenCode, Pi.dev, Aider, first planner/implementer/reviewer roles	Small tasks work, long autonomy quickly breaks
Sub-agents & search	Websearch, codesearch, explore/review sub-tasks	Sub-agents help, but easily bloat context
Full-auto workflow	bootstrap, choose-task, implement, stabilize, finish, repair	Full automation is fragile, but stabilize turned out very useful
Scaffolds as lab	10-ideal, Vue 3, Express 5, TS, ESLint, Prettier, API layer	A good starting structure matters more than long instructions
35B local	Qwen3.6 35B A3B, long context, llama.cpp/Vulkan	For the first time, I got the feeling of a working local executor
Current approach	Templates, small slices, checks, review packets, external reviewer by need	The model works best inside a prepared corridor

First came synthetic experiments

It started quite peacefully. I ran synthetic tasks, often with ChatGPT. We came up with prompts, checked how the model holds constraints, writes text, reacts to a task without ready material.

For example, there was an attempt at a personal developer-enthusiast website. The conditions were deliberately unpleasant: no fake repositories, articles, projects, clients, or achievements, but you still really want a beautiful site. I wanted to understand whether the model could make something convincing without a fake biography.

There, the first model habit surfaced quickly: they fill in gaps. Ask it not to use a portfolio, and it still reaches for familiar blocks: projects, cases, articles, achievements. Ask for living text, and you get smooth but plastic delivery. You have to strip out the pathos, empty generalizations, bureaucratic language, and neural-network drama.

Already at this level it became clear that a good result rarely comes from one magical prompt. You need small rules that bring the model back to reality.

The first rules were almost literary:

• don't make up facts
• don't hide emptiness behind beautiful words
• don't pretend the author already has a portfolio
• write simpler
• keep only what you can actually answer for

Later it turned out that exactly the same principle works in code. Only instead of beautiful phrases, the model starts making up architecture, rules, folders, configs, and meaning where it would be better to take a small, clear step.

Then came generators and first scaffolds

After text, I wanted more: quickly raise an app foundation. Not just ask the model to write a component, but give it a task like: create a minimal Vue + Express project, set up TypeScript, ESLint, Prettier, Tailwind, add a couple commands, make it all runnable.

And that’s when the real development part of the experiment began.

The model can impressively quickly create package.json, configs, folders, minimal frontend and backend. On a small demo, this looks almost magical. But ask for a fresh toolchain, and the model’s confidence starts to get in the way. It remembers old configs, mixes up new versions, sometimes fixes a problem in a way that creates a new one two steps later.

There was a very simple example. I asked to set up the latest Vue with the latest ESLint, only package.json, eslint.config.mjs, tsconfig.json, and .gitignore. No app. All on npm. At the end, npm run lint should run. Seems like a small task. But the agent immediately went into websearch for versions and configs.

And there’s no simple answer here. Search is really needed, because frontend tools change fast. But uncontrolled search quickly turns into a context devourer. For a local model, context is expensive. You have to guard it almost like RAM.

Idea	Expectation	What happened in practice
Give the model a fresh frontend setup	It quickly assembles current configs	The model either drags old knowledge or starts searching a lot
Ask for only a few files	Context stays small	Even a small task can lead to websearch
Use a generator as a start	You’ll get a fast clean project	Without rules nearby, strange solutions appear
Make the perfect scaffold immediately	Set it up once, then live happily	The scaffold itself became a separate lab

At this point I first started thinking not just about the model, but about the shape of the task. The less the model makes up around, the better. The clearer the boundaries, the less garbage in the result.

9B: quick hopes and a narrow corridor

Next I experimented a lot with Qwen 9B. It was tempting: lighter, faster, easier to fit in memory, doesn’t create the feeling that every launch turns your computer into separate infrastructure. For early agent experiments, this size looked reasonable.

9B can be useful. If you give it a small, clear chunk, it can fix a file, write a simple component, go through an obvious error, collect a small batch of changes. But as soon as a task requires holding architecture, fresh configs, a long chain of actions, or multiple layers of a project, things start to break down.

It can produce working code, and next to it lay down a strange structure. It can fix one thing and break another. It can confidently edit a config that it doesn’t fully understand itself.

Model / mode	How it felt	Where it was useful	Where problems quickly started
Qwen 9B	Light, convenient for quick attempts	Small fixes, simple components, experiments	Architecture, long tasks, fresh configs
Qwen 9B as implementer	Fine, if you give a short packet	Local implementation of a small slice	Starts losing edges if the packet is too wide
Qwen 9B + remote planner/reviewer	Already looks like a workflow	Plan from outside, small implementation locally	Context transfer and repair packets require discipline
Qwen 9B + full-auto idea	Tempting on paper	Educational and semi-automatic runs	Full autonomy quickly becomes fragile

So the next step wasn’t new prompts, but wrappers. I wanted the model to work not as one chat, but as a participant in a process.

OpenCode, Pi.dev, Aider, and the idea of roles

I tried different tools and schemes: OpenCode, Pi.dev, Aider, Codex, local and remote models, different roles. Gradually a scheme emerged with separation of responsibilities:

Workflow pipeline: planner → implementer → reviewer → repairer

First this looked cumbersome. Then it became clear that there’s meaning in this cumbersomeness. It’s hard for a local model to be given a large task all at once. It’s better to pass it a small piece with clear boundaries and a specific check.

Parallel to this, the idea of a local orchestrator emerged. The main OpenCode chat on local Qwen manages sub-agents: remote slice planner, local implementer, remote reviewer, local repairer. In theory, the main agent holds state, and sub-tasks do the dirty work. Reality was less smooth, but the direction was useful.

My workflows quickly started growing bureaucracy

After ready tools, I started making my own templates. Commands appeared like bootstrap-project, choose-task, implement-slice, finish-packet, revert-packet. Then start-packet, work-packet, stabilize-packet, inspect, abort-packet. The idea was clear: each command runs in a new session, gets the minimum necessary context, does one stage, and writes artifacts.

The approximate chain looked like this:

bootstrap-project
  ↓
answers.md / project docs / current state
  ↓
choose-task or start-packet
  ↓
next-build-packet.md
  ↓
work-packet / implement-slice
  ↓
checks
  ↓
stabilize-packet
  ↓
finish-packet

It looked almost like a small app factory. In practice, the factory quickly started demanding care.

Documents repeated each other. README turned out to be useless. AGENTS.md grew and became worse. The model confused workflow files with the product. Internal folders like .workflow and .pi had to be separately protected from being perceived as part of the project. Commands with arguments turned out not so useful, because bootstrap and inspect still ask questions themselves. Some restrictions that initially seemed like protection only hindered the semi-automatic experiment.

What I added to the workflow	Why it seemed like a good idea	What turned out
Lots of docs at start	The model will understand the project better	Documents start repeating and aging
Long AGENTS.md	More rules, fewer errors	Long rules are harder to read and harder to follow
Commands with arguments	Can run everything more precisely	In reality, commands often ask for what they need
Bootstrap the whole project	The model will have the full picture	The full picture quickly turns into extra context
Subagents for checks	The main flow will stay clean	Sub-agents can also burn a lot of context
Stabilize pass	Final cleanup	One of the most useful stages of the whole process

The most useful stage turned out to be stabilize. When the model has already done something, it’s easier for it to look at the specific result, find gross errors, and bring the project into order. Full autonomous development from one idea remained fragile, but the cycle made a piece, checked it, stabilized it, committed it — that looked much closer to real usefulness.

Sub-agents: a good idea, if they don’t eat the whole house

Separately, I thought a lot about sub-agents. I wanted to relieve the main context. Let a separate agent go to websearch, codesearch, explore, review, gather information, and return a short verdict. Then the main flow won’t be clogged with search, logs, and intermediate garbage.

In theory, it sounds great.

In practice, I saw how a sub-agent accumulated about 59 thousand tokens of context, did several searches, worked for about twelve minutes, and then the unpleasant question arose: how much did all this really relieve the main process? Even if technically the sub-task completes separately, its final report easily becomes too long. And if the report is long, the main context puffs up again.

Good sub-agent:
• does one sub-task
• reads a limited set of files
• returns a short verdict
• gives paths and specific findings
• doesn't write an essay in the main chat

Bad sub-agent:
• explores everything randomly
• searches a lot
• quotes a lot
• brings a long report
• creates an illusion of progress, but clogs context

With websearch and codesearch, the story is similar. I added them to Pi agent, discussed how OpenSpec might use this, thought about whether to allow free search. In the end, free mode turned out to be dangerous. The model loves to search, because searching looks like work. A limit works better: go to the internet a couple of times, when you really need fresh versions, documentation, or real examples. For everything else, the local project context, short task packet, and checks are enough.

First live applications and unpleasant details

Parallel to workflows, I tried real applications. Sometimes the result already started to please. For example, there were auto-generated prototypes with dark liquid-glass UI, Active / Pinned / Trash tabs, modal editor, confirm dialog, SQLite sessions, backup/restore scripts, and LAN-friendly dev.

At some point, a local 35B was able to generate a wide working prototype, then fix a long TypeScript problem itself, and call an explore sub-agent. This was an important shift. Before this, agent schemes often looked like a fight with tools. Here, for the first time, there was a feeling that a local model could be an executor for a real personal project.

But small disasters didn’t go away anywhere.

In one editor, the model tried to add attributes to disable autocorrect on a mobile keyboard. The idea was normal: a specific editor went crazy from auto-replacement and started typing word fragments on the right. But the local model mixed up the brackets, and the application crashed with a syntax error in compiled JS. The fix was simple, but the episode itself shows the level of risk: the model can understand the intent, make almost the right edit, and still break the app with one stupid bracket.

Another example was with Sovka. After the redesign to dark liquid glass, the interface became noticeably better, but the model created a second element with id=“app” inside AppShell. The root app was already in index.html, and a second such id broke the app. Then I had to rename the inner shell to app-shell and add a guardrail: don’t create a second #app, the inner container should be called #app-shell or a class.

Incident	What the model tried to do	How it ended	What guardrail appeared
Mobile autocorrect in editor	Add attributes to editor root	Broke brackets and JS build	Small UI edits also need verification through build
Second id=“app” in Sovka	Add inner shell	Broke the app due to root id conflict	#app only in index.html, shell is called #app-shell
Websearch for small setup	Find fresh versions	Context started bloating	Search only when needed and with a limit
Large docs in workflow	Help the model remember the project	Model confused workflow and product	Separate internal folders explicitly
External review without wiring context	Find error in diff	Reviewer could be wrong due to incomplete packet	Review packet must include wiring context

These things are hard to catch with a beautiful system prompt. They’re better turned into short rules that grew from real disasters.

10-ideal: scaffold as a laboratory

An important stage was 10-ideal. I didn’t see it as the final notes app. It was a laboratory for Qwen, Pi, and OpenCode: pnpm workspace, Vue 3 + Vite + TypeScript on the frontend, Express 5 + TypeScript on the backend, strict TypeScript, ESLint, Prettier, Tailwind v4, Vue Router, API client layer, later ideas of Pinia, Zod, OpenAPI, SQLite, auth, tests, and CI.

There gradually appeared an important insight: a local model needs a good starting environment. Not a huge architectural encyclopedia, but a clean scaffold with clear folders, check commands, and a few short rules.

I started leaning more toward type-first structure, not feature-first. For the frontend, something like:

src/api
src/stores
src/router
src/layouts
src/views
src/components/ui
src/components/<domain>
src/styles

For the backend:

src/routes
src/services
src/repositories
src/validation
src/middleware
src/config
src/db
src/types

The model works better when it doesn’t have to reinvent the project shape every time. In an empty folder, it fantasizes. In a good scaffold, it more often continues an existing thought.

Scaffold as rails: chaotic empty folder → model + rails → organized project

On the basis of this approach, living things started to appear. One example is a lightweight analog of pi-web for LAN/self-use: project folders management, creating new folders, list of Pi sessions through SDK, reader view, SSE events, session creation, guard against a second prompt in a running session, running/idle status through finally. There was still a lot of work to do, but the feeling was different. The model was no longer sticking random Vue components. It was working inside a form that could be continued.

Checks, reviews, and an unexpected lesson about external models

I had an interesting episode with review-collect and external review. The local model passed a task on the notes scaffold, and an external reviewer erroneously marked a blocker on POST /api/notes, because it didn’t see express.json() in the wiring context. After that, it became clear that diff-audit is useful, but the review packet must include not only the diff, but also important pieces of wiring and untracked files.

Interestingly, the local Qwen in that case gave a more useful review than the external model. This doesn’t mean external models are useless. Rather, they depend very much on the packet they were given. If the packet is incomplete, the reviewer starts confidently complaining about things that don’t actually exist in the project.

Review format	What’s good	Where the risk is
Local review by Qwen on project	Sees more of the current context, can be practical	Can be too soft on its own solutions
External reviewer on diff	Fresh perspective, cheap critic, good for repair packet	Makes mistakes if the packet doesn’t contain wiring context
Full project dump	Lots of information	Expensive in context and often noisy
Deterministic review packet	Managed volume, easy to pass further	Need to know in advance which files matter

After that, I started valuing not the model-reviewer itself, but the quality of the packet. A good packet is sometimes more important than the difference between models.

Transition to 35B

At some point it became clear that 9B was too tight for the main role. I tried different options: Qwen Coder 30B, Qwen 27B, Qwen2.5 Coder 14B, different quantizations. Some ran, but were too slow for a normal cycle. Qwen3 Coder 30B A3B Q3_K_L technically worked, but one audit-agent took about 700 seconds. Qwen3.5 27B on a full run reached tens of minutes. The model could be smarter than 9B, but the real workflow kills not only stupidity, but also latency.

A real shift happened with Qwen3.6 35B A3B in GGUF, especially with UD-Q4_K_M. On my setup with Ryzen 7 5700X, 32 GB RAM, and a GPU with 16 GB VRAM, it became the first large local model that could be seriously integrated into development.

In a good configuration, I saw about 22–25 tokens per second on generation and about 280–318 tokens per second on prompt processing. Sometimes long prompt eval was lower, but still tolerable. At the same time, memory immediately became part of the workflow: video memory can be about 14 out of 16 GB, RAM about 22 out of 32 GB, and every extra browser, VS Code, Docker, or second session is already felt.

Model practicality matrix: speed vs capability

Model	How it felt in my experiments	Where it was useful	What held back
Qwen 9B	Fast and convenient for small tasks	Simple fixes, short implementations, early agents	Narrow corridor, weaker at holding architecture
Qwen Coder 30B A3B Q3_K_L	Technically launched	Check the idea of a large coder-model	Audit-agent about 700 seconds, too slow
Qwen3.5 27B	Smarter than small models, but heavy	Experiments with long runs	Tens of minutes on a full run
Qwen2.5 Coder 14B	Tried as an intermediate option	Separate coding tasks	Speed and quality didn’t give the needed balance
Qwen3.6 35B A3B UD-Q4_K_M	The first option that feels like a working local executor	Agent coding, review, feature slices, scaffold-development	RAM, VRAM, long context, generation cost

With 35B, the ceiling changed. The model holds the task better, understands architectural notes better, works more calmly with a large project, and more confidently fixes its own errors. But old problems didn’t disappear. Long context is expensive. Full reprocessing feels painful. If the agent starts reading a lot, searching a lot, and writing a lot in chat, speed drops, and the session becomes heavy.

I thought I’d be choosing models. Instead, I started discussing batch, ubatch, KV cache, prompt cache, context at 65k or 81k, how much RAM to squeeze from the browser, how not to kill the system, what to do with Docker Desktop, how much to leave for VS Code. Somewhere at this stage, local LLM development stopped being only about prompts.

Context turned out to be a separate currency

Long context was a fundamental topic for me. Without it, agent development quickly falls apart: the model forgets documents, loses the current packet, doesn’t see the architecture, repeats old solutions. But you can’t use large context as a junk drawer.

The more I experimented, the more I wanted to push details into files and artifacts, and give the main flow a short packet. If the model writes a huge report in chat, it harms the next step. If the reviewer makes a short repair packet, it can be passed to the local executor without extra noise.

Context budget: tank with good and bad elements

What started working better:

• long details live in files
• the main chat gets a short packet
• review returns findings, paths, verdict
• implementer doesn't write a long report
• checks run after each step
• formatting is done on specific files
• the next step doesn't read the whole history again

Context, memory, and speed gradually became part of one system. On paper, you can say: just get a smarter model. In practice, you have to ask differently: how much does it eat, how much does it generate, how much does it hold, how often can you run checks, how painful is it to restart a session, can you simultaneously hold a browser and VS Code.

Where remote models are still useful

After the transition to 35B, remote models didn’t disappear. Rather, their role became calmer. Free Laguna and other OpenRouter models are useful as an external planner, critic, or reviewer. Especially when you need to look at a solution from the outside.

But giving them the whole project every time is inconvenient. A one-time external review works better: collect a review packet, give it to the model, get specific notes, return a short repair packet to the local executor.

In the end, the external reviewer became not the main brain, but an extra pair of eyes. The main work still comes down to the shape of the task.

What changed in expectations

If you look back, expectations changed several times.

First, it seemed that the right prompt was needed. Then it seemed that the right agent was needed. Then the right workflow. Then it became clear that you need an environment: templates, checks, small packets, clear commands, short rules, a normal way to transfer state between sessions.

My current rules for local LLM development

What currently looks most workable:

• a small slice instead of a large task
• a good scaffold instead of an empty folder
• a short AGENTS.md instead of an encyclopedia
• checks after each step
• stabilize pass before the next stage
• websearch only when needed
• review packet instead of a huge dump
• formatting only touched files
• details in artifacts, not in chat
• one heavy local executor at a time

Another conclusion: you need different templates. One scaffold for a Vue + Express app. Separate for a semi-static personal site. Separate for CLI. Possibly separate for a small backend-only service. Each should have working check commands, a clear structure, minimal docs, and a few rules born from real breakages.

Where local models are already useful

Task	9B	35B
Small fix	Good	Good
Simple feature	Tolerable	Good
Component in a ready scaffold	Fine	Good
Architectural plan	Weak	Fine
Reviewing your own code	Limited	Useful
Working with a large context project	Quickly becomes tight	Real, but expensive
Long autonomous development	Bad	Still risky
Stabilize pass	Sometimes useful	Very useful
Working without scaffold	Often chaos	Better, but still risky
Working in a good scaffold	Fine	Very useful

In the end

The most pleasant thing about all this experience: progress is really there. Early experiments often looked like a fight with beautiful chatter. Then came the fight with configs, memory, agents, permission prompts, bloated documents, and sessions with tens of thousands of tokens. Now a local 35B can already make a piece of a real application on a good scaffold in a few hours. With bugs, with review, with human oversight, but it’s already working material.

The main skill turned out not to be writing the perfect prompt. It’s much more important to learn to build an environment around the model where it’s hard to do stupid things and easy to do useful work.

A local LLM in this format is like a worker in a personal workshop. It needs tools, a workbench, clear templates, a short task, and result verification. If you throw a whole idea at it, it will easily make a mess. If you give it a good scaffold, a small slice, and a normal set of checks, it starts being useful.

For me, that became the main conclusion. A local model can already be part of development. But value appears not from one model, but from everything around it: templates, packets, reviews, checks, limited search, careful context, and an honest attitude toward its weak spots.