What If the ‘Cleanest’ Code Is the Wrong Solution?
In our continuing experiment with Trio Programming—two engineers and an AI—we decided to level up. Our first session was a slow, painful grind of fixing our environment. This time, with a stable foundation, we aimed for speed. Our new strategy: write comprehensive tests ourselves, then give the AI the freedom to implement the solution in one big step.
The initial results were promising. The AI produced working code that passed our tests. But then, our instincts as seasoned developers kicked in. We saw the AI’s implementation—a simple Map<String, Object>—and reflexively identified it as a “code smell”. We spent the next hour trying to refactor it into a “cleaner,” more object-oriented design using the Composite pattern.
That’s when we fell into a trap. Our pursuit of clean code was leading us toward a solution that was elegant, sophisticated, and completely wrong. This led us to our second major discovery: In AI-augmented development, the biggest risk isn’t bad AI code, but good human intuition applied to the wrong problem.
Our Setup: Aiming for a Bigger Step
Our team remained the same: I (Nik) acted as the driver for GitHub Copilot, while Javier served as the strategic navigator. Having stabilized our Java, Spring Boot, and Gradle environment in the last session, we were ready to test a new hypothesis: if we write strong, expectation-focused tests, we can trust the AI with a larger implementation scope and move much faster.
The flow was simple:
Human engineers write a small, focused test with clear assertions.
Let the AI generate the implementation code in a single, larger step to make the test pass.
Trust the tests to validate the AI’s work, rather than meticulously reviewing every line of generated code.
The Failed Experiment: Refactoring into a Corner
The first part of the experiment worked. We added two tests for our hierarchy API, one for a root-only employee and one for a simple employee-supervisor relationship. We then prompted the AI: “tests looks good, let’s make postHierarchy method for passing all of them”.
The AI’s implementation worked, save for one minor edge case we quickly fixed. But we weren’t satisfied. The code returned a Map<String, Object>, and our developer brains screamed for type safety and better design.
The “Code Smell” Diagnosis: We prompted the AI with our concern: “maybe, response object will make the readability of the code better and will reduce smell of code?”. This initiated a refactoring plan to introduce a dedicated
HierarchyNodeclass.Applying a Design Pattern: We pushed further, suggesting a more formal structure: “maybe we can apply composite pattern... to our response?”. The goal was to create a pure, object-oriented hierarchy and eliminate the
Mapentirely.The Collision with Reality: Our final prompt revealed the fatal flaw in our logic: “can we avoid to use Map if we will use Spring Boot which we have in our project?”.
The AI’s response was the turning point. It patiently explained that given our requirement for dynamic JSON keys (e.g., “Jonas”: { “Sophie”: ... }), a Map or a structure that serializes like one was unavoidable with Spring Boot and its default Jackson serializer.
We had spent a significant part of our session chasing an elegant design that was fundamentally incompatible with the constraints of our framework and the explicit requirements of the kata. As I noted in my log, “we spend time trying to add something not workable to the code”. The AI’s initial, simpler solution wasn’t a code smell; it was the correct, pragmatic solution from the start.
Principles That Actually Work
This humbling experience confirmed our new hypothesis and revealed principles for a more effective human-AI workflow.
Focus on “What,” Not “How” (Test-Focused Development). Our initial strategy was correct. The most valuable role for the human developers is to define the behavior of the system through precise, comprehensive tests. When we focused on the expected JSON output, the AI produced correct code. When we focused on our preconceived notions of “good” internal implementation, we wasted time. The tests are the contract; the AI’s job is to fulfill it.
The AI is a Mirror for System Constraints. The AI is more than a code generator; it’s an interactive expert on the toolchain. It didn’t just reject our idea; it explained why it wouldn’t work within the Spring Boot ecosystem. This prevented us from going further down a dead-end path. Use the AI not just to write code, but to validate your architectural assumptions against the framework’s reality.
Codify Your Learnings into the System. A failed experiment is only a waste if you don’t learn from it. The most productive outcome of our refactoring dead-end was updating our
.github/copilot-instructions.mdfile. We added an explicit refactoring protocol and guidance on when to challenge the AI’s use of patterns versus accepting framework constraints. This turns a session’s lesson into a permanent upgrade for the trio’s workflow.
Unexpected Discovery: AI Generalizes from Specifics
After our refactoring detour, we returned to our Test-Focused workflow. We added much more complex tests, including one with multiple employees reporting to the same supervisor and another with a full four-level hierarchy.
The surprising part? The AI’s existing implementation passed these complex tests without any modifications. This revealed a powerful insight: the AI is remarkably good at generalizing a solution. It needed a few simple, specific test cases to establish the core logic. Once that logic was in place, it was robust enough to handle more complex scenarios automatically. The “big step” works, but it needs to be built on a foundation of small, clear examples.
The Central Paradox of AI-Driven Speed
This leads to the central paradox we uncovered in this session: To move faster with big, AI-generated implementation steps, you must first slow down and write smaller, more precise human-guided tests.
Our desire for speed was not at odds with the discipline of TDD; it was enabled by it. The quality of the AI’s large-scale contribution was directly proportional to the quality of the small-scale expectations we defined. You cannot achieve reliable speed by simply telling the AI “build this feature.” You achieve it by saying “build something that satisfies these very specific, verifiable behaviors.”
Conclusion: We Are Architects of Behavior, Not Just Code
Our second session was a success, but not because we wrote code faster. It was a success because we learned how to trust our tests more than our own implementation habits. The “Test-Focused Development” rhythm—small tests by humans, big implementation by AI—feels right.
The dynamic is shifting. Our job is becoming less about crafting the perfect implementation and more about architecting the perfect set of expectations. We define the contract with rigorous tests, and the AI, our tireless third programmer, finds the most direct way to fulfill it—even if it’s not the way we would have written it ourselves.



