How to Introduce
Tooling Assisted Development
A practical guide for engineering leaders who want to add AI coding tools to their workflow without eroding standards. Includes a complete sample project with copy-paste governance rules, measurement dashboards, and pilot checklists.
Quick Take
Most teams fail with AI coding tools because they roll them out without rules. The result is unmaintainable code, security holes, and senior engineers who stop reviewing pull requests because "the AI wrote it." Success comes from treating AI as a new dependency: define what it can touch, measure the output, and review it like any other junior contributor. This guide gives you the governance framework, a complete sample project, and copy-paste templates to do it properly.
Signs Your Team Needs a Framework
Developers are using Copilot or Cursor but no one tracks what it generates or whether it improves velocity
Pull requests contain AI-generated code that no one can explain or justify at review time
Your board or CEO is asking "why aren't we vibe coding?" and you have no structured answer
A developer pasted proprietary code into a cloud AI tool and you do not know what data left your network
The Sample Project: TaskFlow API
All examples in this guide use a single, realistic sample project so you can copy-paste governance rules directly into your own codebase. TaskFlow is a small REST API for task management - complex enough to have real concerns, small enough to read in one sitting.
Project Overview
- • Language: Python with FastAPI
- • Database: PostgreSQL with SQLAlchemy ORM
- • Architecture: Layered (handlers → services → repositories)
- • Tests: pytest with coverage reporting
- • Secrets: OAuth2 JWT authentication, database credentials via environment variables
Directory Structure
taskflow/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI app factory
│ ├── config.py # Pydantic settings (reads env vars)
│ ├── dependencies.py # FastAPI dependency injection
│ ├── handlers/
│ │ ├── __init__.py
│ │ └── tasks.py # HTTP route handlers
│ ├── services/
│ │ ├── __init__.py
│ │ └── task_service.py # Business logic
│ ├── repositories/
│ │ ├── __init__.py
│ │ └── task_repository.py # Database access
│ ├── models/
│ │ ├── __init__.py
│ │ └── task.py # SQLAlchemy ORM models
│ └── schemas/
│ ├── __init__.py
│ └── task.py # Pydantic request/response schemas
├── tests/
│ ├── conftest.py
│ ├── test_handlers.py
│ └── test_task_service.py
├── alembic/ # Database migrations
├── requirements.txt
├── Dockerfile
└── docker-compose.yml
What Makes This Representative
TaskFlow has every layer that causes debate in AI governance: route handlers (boilerplate), business logic (human-only), database models (partially assisted), and authentication (strictly human). If you can govern this codebase, you can govern yours.
Step 1: Write the Governance Rules
Before any developer opens an AI tool, write down what it is allowed to generate. This is your AI usage policy, and it lives in version control next to your code.
Copy-Paste: AI Governance Policy
Create docs/AI_GOVERNANCE.md in your repository. This is the exact template we use with clients.
# TaskFlow AI Governance Policy
# Version: 1.0
# Last reviewed: <date>
# Owner: Engineering Lead
## 1. Tool Allowlist
Only the following AI coding tools are approved for use on the TaskFlow codebase:
| Tool | Use Case | Approved By | Date |
|------|----------|-------------|------|
| GitHub Copilot | Autocomplete, inline suggestions | CTO | 2026-05-16 |
| GitHub Copilot Chat | Explaining error messages, generating docstrings | CTO | 2026-05-16 |
Unapproved tools include, but are not limited to: ChatGPT web interface for code generation, Claude web interface for production code, Cursor (pending data governance review).
## 2. File Classification
Every file in the repository is classified by risk level. AI assistance rules differ by level.
### GREEN - AI-assisted generation allowed
Directories:
- `tests/` (unit tests for existing code)
- `app/schemas/` (Pydantic request/response schemas)
- `docs/` (README, API documentation)
- Migration stubs in `alembic/versions/` (boilerplate only)
Rules:
- Developer must review every generated line before commit.
- Generated code must pass existing linting and type checking.
- Pull request description must flag which files contain AI-generated code.
### AMBER - AI-assisted with mandatory human review
Directories:
- `app/handlers/` (HTTP route handlers)
- `app/repositories/` (database access layers)
Rules:
- AI may generate scaffolding, but business logic branches (if/else, loops, calculations) must be written by a human.
- Every AI-generated block must be marked with a comment: `# AI-generated - reviewed by <name> on <date>`.
- Requires approval from a senior engineer (not the author) before merge.
### RED - Human-only authoring
Files and directories:
- `app/config.py` (environment variable handling, secrets)
- `app/dependencies.py` (authentication, authorization)
- `app/services/task_service.py` (core business logic)
- `docker-compose.yml` (infrastructure, database credentials)
- Any file containing:
- Authentication or authorization logic
- Cryptographic operations
- Payment processing or financial calculations
- PII handling or data export logic
- Environment variable parsing or secret management
Rules:
- No AI-generated code in RED files.
- No exceptions without written sign-off from the CTO.
- RED files marked with a header comment: `# HUMAN-ONLY - see AI_GOVERNANCE.md`.
## 3. Data Sovereignty
Proprietary code and business logic must not leave the company's network.
- Cloud AI tools (GitHub Copilot) may only be used on GREEN files.
- AMBER and RED files must be edited with local-only tooling or with cloud AI features disabled.
- Developers must confirm in their pull request that no RED files were processed by cloud AI.
- Violation: first offence = warning. Second offence = tool access revoked.
## 4. Pull Request Requirements
Every PR containing AI-generated code must include:
1. **AI Usage Summary** - list every file with AI-generated code and the tool used.
2. **Verification Checklist** - the author ticks:
- [ ] I reviewed every line of AI-generated code for correctness.
- [ ] I ran the full test suite and all tests pass.
- [ ] I confirmed no RED files were sent to cloud AI.
- [ ] I added `# AI-generated` comments where required.
3. **Review Standard** - AI-generated code gets the same scrutiny as code from a junior developer on their first day.
## 5. Review Escalation
A senior engineer may block any PR that:
- Contains AI-generated business logic in AMBER files without clear justification.
- Modifies RED files without the `HUMAN-ONLY` header intact.
- Fails to answer "why this approach is correct" when asked in review.
No blame, no delay - just a request to rewrite the flagged section manually.
## 6. Incident Response
If AI-generated code causes a production incident:
1. Revert the change immediately (do not debug in production).
2. Root cause: was the tool, the prompt, or the review at fault?
3. Update this policy if the gap is structural.
4. Communicate learnings to the team within 24 hours.
How to Apply This to Your Project
Copy the file above into your repository. Replace `TaskFlow` with your project name. Map your directory structure to the GREEN / AMBER / RED classification. The only hard rule: if a bug in a file could cost money, expose data, or bring the system down, that file is RED.
Step 2: Measure the Baseline
You cannot prove AI tools improve velocity if you do not know your starting point. Measure before you introduce the tool.
Copy-Paste: Baseline Metrics Template
Create docs/AI_PILOT_BASELINE.md and fill in the numbers for your team.
# AI Pilot Baseline - TaskFlow API
# Sprint: 24 (2026-05-01 to 2026-05-14)
# Measured by: Engineering Lead
## Baseline: Test Generation
| Metric | Value | Notes |
|--------|-------|-------|
| Average time to write tests for a new handler | 45 minutes | Measured over 3 handlers |
| Test coverage (pytest-cov) | 62% | Only happy paths covered |
| Tests written per sprint | 8 | Often deprioritised for features |
| Bugs found in production related to missing tests | 2 | Both edge case failures |
## Baseline: Documentation
| Metric | Value | Notes |
|--------|-------|-------|
| README last updated | 3 months ago | Out of sync with current endpoints |
| Inline docstrings in handlers | 15% | Most functions undocumented |
| API documentation generated | No | Swagger not configured |
## Baseline: Developer Sentiment
| Metric | Value | Notes |
|--------|-------|-------|
| Team size | 4 developers | 2 senior, 2 mid-level |
| Developers who have tried AI tooling | 1 | Junior dev used Copilot for personal project |
| Confidence in codebase quality (1-5) | 3 | Consistent feedback: "tests are thin" |
## Target Improvement (to be evaluated at end of pilot)
| Metric | Target | How we measure |
|--------|--------|----------------|
| Test generation time | -30% | Time-tracking on 3 new handlers |
| Test coverage | +15 percentage points | pytest-cov report |
| Tests written per sprint | +50% | Count in sprint retrospective |
| Production bugs from missing tests | 0 | Incident log |
| README accuracy | Current | Manual review by PM |
| Inline docstrings | +40 percentage points | Automated count via script |
| Developer confidence | +0.5 points | Anonymous survey (same 1-5 scale) |
Why these metrics matter: Time and coverage prove velocity. Sentiment proves adoption. If developers secretly hate the tool, the numbers will look good for a sprint and then collapse when the novelty wears off.
Step 3: Run a 2-Week Pilot
Two developers. One tool. One use case. This is the only way to separate signal from noise.
Copy-Paste: Pilot Charter
# AI Tooling Pilot Charter - TaskFlow API
# Duration: 2 weeks (Sprint 25: 2026-05-15 to 2026-05-28)
# Approver: CTO
## Scope
- **Tool:** GitHub Copilot (IDE autocomplete + Copilot Chat for explanations)
- **Use case:** Test generation and inline documentation for GREEN files only
- **Participants:** 2 volunteer developers (1 senior, 1 mid-level)
- **Exclusions:** No AMBER or RED files. No multi-file edits via Copilot Chat. No cloud uploads of RED files.
## Daily Stand-Up Addition
The two pilot participants answer one extra question:
- "What did the AI tool generate for you today, and did you keep or rewrite it?"
This takes 30 seconds and surfaces patterns faster than any metric dashboard.
## Week 1 Check-In (scheduled, 30 minutes)
1. Review the governance compliance check:
- Any RED files touched by AI? (expected: zero)
- Any AMBER files incorrectly classified? (fix immediately)
- Are `# AI-generated` comments present where expected?
2. Subjective feedback:
- What is working better than expected?
- What is slower or more frustrating?
- Any surprises in the generated code quality?
3. Mid-pivot option:
- If the tool is clearly wrong for the codebase, stop the pilot early. Document why.
## Sprint Retrospective (end of Week 2)
Measure against baseline. Fill in the same metrics table from `AI_PILOT_BASELINE.md`:
| Metric | Baseline | Pilot Sprint | Delta |
|--------|----------|--------------|-------|
| Test generation time | 45 min | ? | ? |
| Test coverage | 62% | ? | ? |
| Tests written per sprint | 8 | ? | ? |
| Inline docstrings | 15% | ? | ? |
| Developer confidence | 3.0/5 | ? | ? |
## Go / No-Go Decision
- **Expand:** All GREEN targets met, developers want to continue, no governance violations.
- **Pivot:** Results are mixed - try a different use case (e.g., documentation instead of tests).
- **Stop:** No measurable improvement or a governance violation occurred. Document findings and revisit in 3 months.
What a Healthy Pilot Looks Like
- ✔ The senior developer finds Copilot useful for boilerplate but rewrites 40% of the generated assertions
- ✔ The mid-level developer writes edge-case tests they would have skipped before
- ✔ No RED files were processed by AI - confirmed by IDE logs
- ✔ Both developers can explain every line they committed
- ✔ Coverage increased from 62% to 78%
What a Failed Pilot Looks Like
- ✗ A developer used Copilot to generate the entire `task_service.py` business logic layer
- ✗ Generated tests pass but do not actually assert anything meaningful
- ✗ A RED file (config.py) was sent to Copilot Chat by mistake
- ✗ Coverage stayed flat because developers stopped writing tests manually and only use AI stubs
- ✗ A senior engineer rejects a PR because the AI-generated code is harder to review than hand-written code
Step 4: Review AI-Generated Code Properly
The most important skill in tooling assisted development is not prompting - it is critiquing. Here are real examples from the TaskFlow codebase.
Example 1: GOOD - AI-Generated Tests With Human Verification
Input prompt to Copilot (in a test file):
# Prompt: Write pytest tests for task_service.create_task
# that cover: valid input, duplicate title, missing required field.
AI-generated output (after developer review and edits):
def test_create_task_success(mock_db):
# AI-generated - reviewed by alice on 2026-05-16
service = TaskService(mock_db)
task = service.create_task(
title="Write tests",
description="Cover edge cases",
owner_id=1
)
assert task.title == "Write tests"
assert task.status == "todo"
mock_db.add.assert_called_once()
mock_db.commit.assert_called_once()
def test_create_task_duplicate_title(mock_db):
# AI-generated - reviewed by alice on 2026-05-16
# EDIT: added explicit assertion for error message
service = TaskService(mock_db)
mock_db.query.return_value.filter.return_value.first.return_value = Task(id=1)
with pytest.raises(ValueError) as exc_info:
service.create_task(title="Duplicate", description="x", owner_id=1)
assert "already exists" in str(exc_info.value) # <- human added this line
Why this works: The AI generated the structure and three test cases. The human verified that the assertions actually test meaningful behaviour (not just "does it run?"). The `# AI-generated` comment flags it for reviewers. The `EDIT` note shows what the human changed - a pattern that builds trust in the codebase.
Example 2: BAD - AI-Generated Business Logic in an AMBER File
Input prompt in `app/handlers/tasks.py`:
# Prompt: Add a PATCH endpoint to update task status
AI-generated output (rejected in review):
@router.patch("/tasks/{task_id}/status")
def update_task_status(task_id: int, status: str, db: Session = Depends(get_db)):
# AI-generated - reviewed by bob on 2026-05-17
task = db.query(Task).filter(Task.id == task_id).first()
if not task:
raise HTTPException(status_code=404, detail="Not found")
task.status = status
db.commit()
return task
Why this was rejected: three issues the AI missed:
1. No input validation on `status` - any string is accepted, breaking the status enum contract and allowing invalid state transitions (e.g., a fresh task jumping straight to archived).
2. No authorisation check - any authenticated user can update any task, including tasks owned by other users (a textbook IDOR vulnerability).
3. No audit logging or event emission for status changes, which breaks downstream reporting.
The reviewer rewrote the handler manually with a service-layer call, an enum-validated request schema, and an ownership check. The lesson: AI is fine for the route decorator and response shape (GREEN-level scaffolding), but the conditional logic and security surface must be human.
Example 3: RED - What Never Gets Generated
This file stays human-only. No AI tool goes near it.
# HUMAN-ONLY - see AI_GOVERNANCE.md
# app/config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
database_url: str
jwt_secret: str
jwt_algorithm: str = "HS256"
jwt_expiry_hours: int = 24
class Config:
env_file = ".env"
settings = Settings()
If an AI tool suggests changing the JWT algorithm or adding a default secret, that suggestion is ignored. The `# HUMAN-ONLY` header reminds every developer who opens the file that this is governed territory.
Step 5: Mature the Governance
After the pilot, the rules evolve. Here is how the TaskFlow governance matures over three months.
Month 1: Pilot (2 developers, 1 tool, 1 use case)
- • AI Governance Policy v1.0 live in `docs/AI_GOVERNANCE.md`
- • GREEN classification covers tests and schemas only
- • Daily stand-up includes one AI usage question
- • All AI-generated PRs flagged manually by author
Month 2: Expand (full team, same use case, add documentation)
- • GREEN classification expanded to include handler scaffolding (no logic branches)
- • Automated check: linter scans for `# AI-generated` comments in AMBER files missing a reviewer name
- • Team training session: "How to critique AI output" - 1 hour workshop
- • Monthly review: Engineering Lead checks for RED file violations via IDE logs
Month 3: Integrate (AI as standard dependency)
- • AI Governance Policy v2.0 - refined based on 60 days of data
- • CI pipeline includes automated detection: if a RED file changes, verify no AI tool was active in the author's IDE session
- • Onboarding doc updated: new developers read AI governance before writing code
- • Quarterly review: revisit metrics, adjust classification, retire or approve new tools
The principle: Governance starts as a manual checklist and matures into an automated guardrail. But it always starts manual. You cannot automate what you have not understood.
What Egon Expert Delivers
This guide gives you the framework to do this yourself. If you want it done faster and with lower risk, we deliver the full programme in four stages.
1. Workflow Audit
We spend two days in your codebase mapping your directory structure, current testing practices, and code review standards. We classify every major module into GREEN / AMBER / RED. You get a written AI governance policy tailored to your project, not a generic template.
2. Governance Framework
We write the complete set of copy-paste documents: AI_GOVERNANCE.md, classification rules, PR checklist templates, and incident response procedures. These are checked into your repository and reviewed with your senior engineers before any tool is introduced.
3. Structured Pilot
We design and run a 2-week pilot with 2-3 volunteer developers. We set baselines, define success criteria, and facilitate the daily check-ins and end-of-sprint retrospective. You get a data-driven go / no-go recommendation, not a sales pitch.
4. Team Training
We run a hands-on workshop for your full engineering team: how to prompt effectively, how to critique AI output, and how to maintain the governance framework as your codebase evolves. The goal is self-sufficiency, not dependency on us.
Ready to Introduce AI Tools Properly?
We help engineering teams adopt AI coding tools with the governance and measurement that protects your codebase.