AI-Native Development, Six Months In
What survived contact with reality across 20+ projects
AI-assisted development is everywhere now. The question for me has moved from “can one person build software with AI?” to a more interesting question: what discipline does it actually take to keep shipping - across multiple projects, over months, without the codebase turning into a landfill?
I attempt to describe my evolving process below. This is not a secret method. Everyone building with AI right now is figuring out their own version of this. What follows is just specificity about what’s working for me at this point in the tools’ evolution - the practices that have survived contact with reality across enough projects that I trust them.
Skylark Creations is a one-person, AI-native product studio I run from Colorado. I build human-friendly apps - tools for wonder - at the pace of curiosity, and I partner with domain experts (when I can find one who wants to have some fun building together). I currently keep 6–10 projects in active motion every day inside a portfolio of 20+. What follows is the operating system that keeps them all moving.
The biggest shift: from pair programming to running a tiny org
The early version of my workflow was simple and I described it in this post about Sagent (my first solo app) in September: design with ChatGPT, implement with Cursor, audit the diff. Two roles, one loop. It worked for a single project at a time.
What changed is that I now treat models like specialists with distinct jobs - not interchangeable assistants I happen to be chatting with.
Claude inside Cursor handles planning and UX coherence. It’s good at “make this clearer” and “make this less fragile.”
Codex inside Cursor is my practical coding partner while I’m in the trench - quick fixes, wiring, the stuff that needs to happen right now.
I don’t let a new model come out without testing it inside Cursor on some right-sized low risk task.
GPT-5.2-Pro and Opus 4.6 are my reasoning and planning layer. GPT proposes architectures. Opus (with extended thinking) tries to break the proposal - fundamental errors, smuggled assumptions, edge cases, things that will hurt at scale.
Codex 5.3 is the workhorse. When it’s time to crank - implementation, scaffolds, repetitive changes across files - this is the engine.
Humans (selectively) bring product truth, domain nuance, taste calibration, security gut-checks, and the job that matters most: telling me when I’m fooling myself.
A single model can be brilliant and still fail in exactly the wrong way at exactly the wrong time. Using several in distinct roles means disagreements surface early - before the codebase fossilizes around a bad assumption.
The loop
Here’s the cycle I try to stay inside:
Name and describe the problem precisely.
Design the architecture by designing the data.
Write a detailed planning doc.
Build continuity that survives agent swapping.
Right-size tasks to model capability and risk.
Ship thin slices.
Measure honestly with real feedback and real data.
Refactor aggressively - especially at feature boundaries.
Repeat, without stalling forward motion.
That’s not aspirational. It’s a working system. Everything below is commentary on how each piece works in practice.
Architecture first: the data is the product
After I sit with a problem for a while, I try to see the data model hiding inside it. What are the entities? What are the invariants? What’s derived versus stored? Where are the boundaries? What will I hate maintaining six months from now?
This is where I bring in GPT-5.2-Pro. I talk through candidate data structures and architectures - usually asking for multiple designs: “simple and shippable,” “scalable and clean,” “robust under uncertainty.” Then I hand the proposed shape to Opus 4.6 with extended thinking and ask it to attack the plan. Where are the silent failure modes? What assumptions are smuggled in? What’s underspecified? What breaks with real users, real mess, real scale? What’s the smallest structural change that prevents a whole class of bugs?
Only after that adversarial pass do I let implementation begin.
Code is cheap now. Commitments are expensive. Architecture is the set of commitments you decide to live with.
The planning doc: a spine for context switching
Every project starts with a planning doc written for two audiences: future me returning after a week of building something else, and AI agents who are powerful, fast, and prone to inventing requirements that don’t exist.
A typical planning doc includes: the problem statement (who hurts, and how), explicit non-goals (the temptations I’m refusing), user promises (what will be reliably true), an architecture snapshot (systems, data, boundaries), milestones as shippable slices, acceptance criteria that are observable, a risk list spanning technical, product, and operational concerns, an instrumentation plan (what I’ll measure and why), and refactor windows (what I intend to rewrite before and after major shifts).
This is how I preserve forward motion. The plan stays legible, so execution doesn’t feel like reinventing the project every time I open the repo.
Continuity: memory that survives the chat window
This is probably the single biggest lesson from scaling past one project. Agentic building collapses without continuity. If you don’t build project memory, you pay the “re-explain everything” tax on every session. Worse: you lose coherence across model swaps and parallel work.
So I treat continuity as an engineering artifact, not a nice-to-have. Each project maintains a decision log (”we chose X over Y, here’s why”), a gotchas list (mistakes that bit us once and shouldn’t bite again), a living architecture snapshot (what exists, how it talks, where it breaks), a “how to run this” ritual (commands, env vars, tests, deploy steps), and a quality baseline (tests, monitoring, analytics expectations).
I also maintain Cursor Environment Docs, an open-source .cursor/ documentation system that keeps environment context fresh, captures mistakes, and prompts updates when docs get stale - so the AI starts each session grounded in reality instead of guessing. (that documentation also becomes unwieldy if you let it - so refactor / reevaluate that, too, at some interval).
A note on tooling: Opus 4.6 is the king now. If I knew I’d only be using Anthropic models going forward, I’d probably just work inside Claude Code with its skills, plugins, and CLAUDE.md conventions - it’s a clean system. But I work inside Cursor because it lets me easily swap between models on an active project, and that flexibility matters right now. I’d use Opus 4.6 for nearly everything if it weren’t so expensive. OpenAI still has catching up to do on the high end - but their workhorse models earn their keep, and being able to route different tasks to different providers on the fly is worth the tradeoff of maintaining my own continuity layer.
One thing I want to say bluntly: long chats are where clarity goes to die. I’ve seen someone call it context rot - as conversations stretch, reliability drops, and you start losing important constraints without noticing. My mitigation is file-based memory: keep sessions focused, write state to disk, reference docs instead of pasting everything into the chat. Continuity is more than convenience.
Automation: three tiers, one boundary
A major upgrade in how I work is being more deliberate about what to automate and what to keep human. I think about it in three tiers.
Memory is the foundation. A repo-level context file that explains architecture, commands, and conventions - kept useful with real examples and updated whenever the AI makes a wrong assumption. I call it the “freelancer test”: would this document help a contractor produce good work on day one? If not, it’s not done.
Workflows are repeatable mechanical sequences. Slash commands for tasks I do often. Hooks that validate output automatically - format checks, test runs, required-section checks - and feed errors back to the agent so it self-corrects before I ever review. MCP server connections that let models pull from analytics, error monitoring, and databases directly instead of relying on me to copy-paste screenshots. Plans and state stored on disk so work survives across sessions without bloating a single conversation. Sub-agents that split research and implementation into separate contexts for quality or parallel speed.
The judgment boundary is where I stop automating. Data collection? Automate it. Format validation? Automate it. Interpreting what the data means, deciding what to investigate next, choosing what to ship and what to tell stakeholders? That stays human. The line is simple in principle and requires constant vigilance in practice: automate the mechanical, keep humans for judgment.
A concrete example: how my daily report went from ritual to one command
The most satisfying expression of this approach right now is the automated daily usage report for agentic.ai.
This report covers product health across acquisition, engagement, trials, revenue, and subscriber activity. In previous years, this sort of work required an employee or hours of my time. When I started the AI-generated report, it used to require 15–20 minutes of manual orchestration every morning: tagging reference files, explaining context, running scripts, assembling data, fixing formatting mistakes.
Now it’s one command: /report
That slash command encodes the entire mechanical workflow. It computes the date and cutoff, finds yesterday’s report as a template, checks recent changes via git history, loads the relevant reference files, runs the metrics scripts, queries analytics via MCP, writes a new report file, updates subscriber records, and commits changes without pushing - because I review first.
The best part is the validation hook. After every write, a script checks the report for missing required sections, placeholder values, and basic math consistency. If validation fails, the error message goes back to the agent, and it self-corrects before I ever read the output.
That pattern - agent acts, hook validates, agent fixes - is one of the cleanest ways I’ve found to get speed without sacrificing trust. And it preserves the correct boundary. Claude does the mechanical work. I do the human work: read, interpret, decide what to investigate, decide what to ship, decide what to tell stakeholders.
A daily report like this becomes the studio’s heartbeat.
Right-sized work
Most AI-assisted failures I’ve seen (including my own) come from giving the model the wrong chunk size. Too big and you get hallucinated structure, missed edge cases, brittle systems. Too small and you drown in coordination overhead and lose the narrative thread. You’ve got to get to know the jagged frontier.
So I size tasks to the model and the risk. Reasoning partners handle architecture decisions, sequencing, tradeoffs, and red-teaming. Workhorse coding agents handle implementation that’s already specified, mechanical refactors, and test scaffolding. Humans handle product truth, domain nuance, and anywhere being wrong has real cost.
I try to shape tasks like good functions: one purpose, clear inputs and outputs, observable success criteria, and tests where possible.
Refactor often - especially at feature boundaries
I refactor aggressively. Before major feature changes: clean seams, simplify interfaces, reduce coupling. After major changes: consolidate duplication, rename for clarity, strengthen tests, delete prototype leftovers.
Refactoring isn’t procrastination. It’s paying rent on architecture. It’s also how you keep shipping without the codebase quietly becoming a graveyard of “almost-right” assumptions that no one ever cleaned up.
Why I run a portfolio
I maintain 20+ active experiments, but only 6–10 are in true motion at any time. The rest are incubating, waiting for a constraint to snap into place, or parked as seeds.
The portfolio approach prevents stalls. When one project hits a wall - waiting on an API decision, stuck on a hard bug, tangled in a design knot, waiting for user feedback / data - I rotate. The planning docs and continuity system keep context switching cheap. The daily telemetry loop keeps me honest about what’s actually working.
Some of what’s currently in motion: Agentic.ai (news intelligence with deduplication, role-aware lenses, and citations you can verify), FrameDial News (consensus facts with contrasting value frames - “facts, frames, receipts”), and Decomposer (turns sprawling goals into structured plans with honest effort routing). A few others are further from the spotlight - a wildflower learning app, a Gmail attention fix, a Buddhism conversation tool (BuddhaUR - which might be my best work) grounded in primary sources. Urban Journeys is ready to help you explore your city. Frolic (still not quite ready for primetime) is the AI that will help you get and off your couch and your screen. They’re all small, all deliberately so - focused, and they all run on the same spine described above.
Quality before wind
Before I invite real users, every project needs a baseline: good test coverage around the core loops, error monitoring that’s release-aware and actionable, analytics that measure outcomes (not attention extraction), and a repeatable reporting loop so I’m steering by reality.
What’s changed recently is that I can increasingly wire this up through APIs and MCP servers, and unify it into a single daily narrative instead of living inside ten dashboards. Instrumentation is not optional. It’s how a one-person studio stays honest.
What I’ve learned (the unglamorous version)
The human jobs in this work remain: taste (what’s worth building and how it should feel), attention (edge cases, cleanup, the stuff models gloss over), and agency (stick with it and shepherd the humans and agents toward a clear goal). But running this system across many projects has sharpened the picture. I wish I were better at taste.
The least glamorous lesson is the most important: write things down and keep them current. Design the data structure before writing code. Automate the mechanical. Keep judgment human. Ship thin slices. Measure honestly. Refactor like your future self lives there.
None of this is secret. The tools are available to anyone. What makes it work isn’t access - it’s the willingness to treat discipline as a feature, not a tax.
And if you’re building your own version of this - with different tools, different taste, a different portfolio - I’d genuinely like to hear what’s working for you.


