Designing AI-Driven Development Workflows

Why Out-of-the-Box Tools Require Customization

Jun 30, 2026

Evaluating the operational efficiency of engineering workflows is essential when integrating advanced language models into daily development cycles. To understand the practical boundaries of autonomous code generation, I conducted an implementation experiment using the GPT-5.3Codex model. The objective was to complete a medium-sized user story involving the integration of SendGrid template rendering and storage capabilities into an established email notification use case.

This assessment contrasts two distinct methodologies: an out-of-the-box framework known as spec-kit, and a lightweight alternative designated as the custom workflow. The custom workflow utilizes chat mode during the initial kick-off and planning stages to generate specific tasks, subsequently shifting to Codex for the explicit implementation of those tasks guided by an AGENT.md operational file and localized skills. The goal is to determine whether spec-kit provides immediate utility without modification or if its inherent structural characteristics necessitate explicit customization.

Architectural Slicing and Pull Request Topography

The structural composition of code updates significantly influences the sustainability of continuous integration pipelines. During the experiment, the custom workflow isolated changes into modular components that aligned directly with my default hexagonal architecture. This approach generated six discrete pull requests. The median size of these code updates remained between three and four files, with the most extensive single update containing five files. Every file generated by this workflow contained exclusively functional implementation code, eliminating secondary artifact noise.

In contrast, the spec-kit framework approached the user story through vertical slicing, attempting to package complete functional business capabilities into each cycle. This strategy yielded four pull requests, but the internal volume of these updates was substantially larger, averaging seven to eight files per pull request. The most expansive update within this set encompassed thirteen distinct files.

From an engineering operations perspective, managing large pull requests introduces definitive maintenance challenges. Reviewing a thirteen-file modification requires deep contextual immersion and can easily exhaust a multi-hour block of defensive engineering time. Conversely, integrating five to seven highly compact pull requests throughout a standard working day introduces negligible cognitive friction, provided the changes remain small and structurally isolated. Notably, both workflows initially introduced an identical rendering bug involving SendGrid template helpers, which required a targeted corrective commit. This suggests that the structural layout of the pull requests, rather than initial code accuracy, serves as the primary differentiator in developer friction.

Discovery Mechanisms and Cognitive Loading

The preparation phase exposes a stark contrast in the type of mental energy required by each workflow. The custom workflow relies heavily on an interactive discovery process during the chat-based kick-off. The model proactively initiated a clarification and planning dialogue to map out the implementation requirements before generating the discrete tasks for Codex.

This interactive session translated into a substantial textual footprint. The initial clarification phase required three distinct iterations of questioning and answering, totaling eight pages and 2,281 words. This was immediately followed by the planning phase, which required two subsequent iterations and produced an additional eight pages and 2,370 words. Cumulatively, this chat dialogue generated 16 pages of standard layout text, or 4,651 words. Assuming an average conversational rate of 150 words per minute, this preparatory phase equates to a thirty-minute collaborative pair-programming session.

The spec-kit framework approaches preparation through localized, static document synthesis rather than ongoing verbal dialogue. Before initiating code generation, the tool compiled eight distinct analytical documents within the specs directory.

An examination of the generated specs directory reveals how this text is distributed across individual documents. The requirements checklist contains 149 words, while the OpenAPI contract specifying the interface changes takes up 102 words. The data model specification consists of 211 words, and the implementation plan spans 409 words. Additionally, the quickstart document contains 140 words, the research summary covers 260 words, the comprehensive technical specification comprises 1,021 words, and the final task breakdown document details 1,540 words.

The documentation total matches the 16-page volume of the conversational workflow but contains 3,832 words of highly dense technical material. When applying an analytical reading standard of 75 words per minute for complex documentation, reviewing this output demands roughly 50 minutes of solitary, rigorous technical analysis. This calculation excludes the initial setup interactions required to seed the tool.

Insight: Engaging in a collaborative, bidirectional technical dialogue yields lower cognitive fatigue than parsing dense, machine-generated analytical documentation independently. The conversational format allows an engineer to guide the discovery path dynamically, whereas the document-heavy approach demands prolonged, solitary code-review stamina.

Estimation Metrics and Delivery Impact

To contextualize project velocity, I utilize a standard estimation scale where one point equates to a minor task, three points represent half of a development iteration, and five points correspond to a full iteration block. Historically, the targeted user story would receive an empirical estimate of three story points.

By utilizing either AI-driven development environment, the effective complexity of the implementation dropped significantly, allowing the story to be re-estimated at two points. This finding aligns with observations gathered over a multi-month period: the strategic application of generative models consistently removes approximately one story point from medium-sized requirements.

However, this efficiency gain exhibits a clear non-linear trend when applied to larger tasks. A single-point reduction on a highly complex, five-point user story does not alter the fundamental delivery architecture or allow the task to be decomposed more effectively. For larger software initiatives, the exact return on investment provided by these autonomous tools requires further empirical evaluation.

Long-Term Repository Maintenance and Documentation Bloat

A critical consideration when adopting spec-kit out of the box is the long-term structural health of the code repository. Generating eight non-service documentation files for a single medium-sized user story introduces a noticeable maintenance tail.

Consider a baseline engineering department consisting of three to four development pairs. If these pairs collectively deliver approximately three completed user stories per development iteration across 26 annual iterations, the repository configuration changes dramatically over time. Under the unmodified spec-kit framework, this delivery velocity results in the accumulation of roughly 600 non-service Markdown and YAML files every year. Managing the lifecycle, accuracy, and relevance of hundreds of static documentation files creates an administrative burden that can quickly devalue the initial velocity gains of automated generation.

Chronological Integration Patterns

The Custom Workflow Evolution

The custom workflow distributed code modifications across six isolated, single-purpose commits containing exclusively functional code. The sequence began with a five-file commit implementing active SendGrid template retrieval, which introduced the core repository interface, its SendGrid implementation, an exception for missing templates, a version value object, and a corresponding repository error test. Next, a four-file commit introduced the Handlebars email template renderer by modifying the build configuration and adding the renderer service, the rendered email domain model, and the renderer test suite.

The third step was a three-file commit handling the storage of the rendered template within the document management system, which impacted the primary use case, the consumer contract test, and the use case test. To address a rendering bug, a two-file corrective commit added explicit support for SendGrid Handlebars helpers within the core rendering logic. This was followed by a three-file commit introducing global exception mapping using an exception handler advice and its corresponding integration test. The evolution concluded with a three-file commit aggregating final integration and regression verifications across the controller and use-case boundaries.

The Spec-Kit Framework Evolution

The spec-kit framework grouped its operations into broader, multi-file updates that combined documentation and implementation boundaries. The process opened with an eight-file initial commit compiling the prerequisite requirements, OpenAPI specifications, data models, plans, quickstart guides, research notes, technical specifications, and task manifests within the specs directory.

This was followed by a thirteen-file monolithic commit deploying the dynamic template rendering architecture, which simultaneously modified the build configuration, the task checklist, the exception handler advice, the SendGrid repository implementation, the missing template exception, the document management system store request, the rendered email domain model, the core use case, and their associated tests. The third phase was a four-file verification commit introducing test coverage for document management system failure scenarios. The cycle concluded with a five-file verification commit ensuring proper handling of missing templates, which updated integration tests, contract verifications, and serialization payloads.

Final Assessment: To Customize or Adopt As-Is

Returning to the original operational query: can spec-kit be utilized effectively without modification? The data suggests that an out-of-the-box deployment introduces distinct operational trade-offs that make customization necessary for long-term health.

While spec-kit succeeds in lowering short-term delivery complexity, its vertical slicing strategy creates overly large pull requests that challenge standard daily review workflows. Furthermore, the generation of extensive static documentation introduces systemic repository bloat that scales poorly across multiple engineering teams.

The ideal path forward requires a hybrid architecture. By customizing spec-kit to inherit the structural instructions of the custom workflow, we can merge the systematic rigor of automated planning with the clean, highly isolated pull request structure required by hexagonal architectures. Future efforts will focus on implementing custom skills within the agent configuration to restrict the generation of non-service files while preserving shared context between the conversational interface and the underlying code generation engine.

Nik Malykhin

Discussion about this post

Ready for more?