What is tooling assisted development?

Tooling Assisted Development is the considered use of AI coding tools inside an engineering workflow where they add value without lowering the bar. It treats AI as an assistant, not a replacement, with clear governance on what AI can generate, what requires human authorship, and how output is verified.

How do I stop developers from over-relying on AI coding tools?

Write a classification policy that labels files as GREEN (AI-assisted), AMBER (AI-assisted with mandatory human review), or RED (human-only). Make the rules visible in every pull request. Hold AI-generated code to the same review standard as junior-authored code. Most importantly, measure outcomes - if coverage drops or bugs increase, restrict the tool's scope.

Can AI tools write tests for my codebase?

Yes, test generation is one of the highest-ROI use cases for AI coding tools. The AI can generate test scaffolding, happy paths, and edge cases faster than manual writing. However, the human developer must verify that every assertion tests meaningful behaviour, not just that the code runs. Generated tests should be treated as a first draft, not a final product.

What should an AI coding governance policy include?

A working policy contains six sections: a tool allowlist with named approvers, a file classification scheme (GREEN/AMBER/RED) tied to your directory structure, data sovereignty rules covering what code can be sent to cloud AI services, pull request requirements such as flagging AI-generated code and confirming no restricted files were processed, a review escalation path that lets senior engineers block AI-generated code without blame, and an incident response procedure for when AI-generated code causes a production issue.

How long should an AI coding tools pilot run?

Two weeks with two to three volunteer developers, one tool, and one bounded use case (such as test generation or documentation). That is long enough to produce meaningful baseline-to-pilot deltas on metrics like test coverage, test writing time, and developer confidence, and short enough to stop quickly if the tool is wrong for your codebase. Anything longer dilutes accountability; anything shorter is noise.

Can AI coding tools be used on security-critical or authentication code?

No. Files containing authentication, authorisation, cryptographic operations, payment processing, PII handling, or secret management should be classified RED and authored only by humans. Even when AI suggestions appear correct, the cost of a subtle vulnerability in a security-critical file far exceeds the time saved. RED files should carry a HUMAN-ONLY header comment and be excluded from any AI tool's working context.

How to Introduce
Tooling Assisted Development

A practical guide for engineering leaders who want to add AI coding tools to their workflow without eroding standards. Includes a complete sample project with copy-paste governance rules, measurement dashboards, and pilot checklists.

Book a Consultation Our Development Service

Quick Take

Most teams fail with AI coding tools because they roll them out without rules. The result is unmaintainable code, security holes, and senior engineers who stop reviewing pull requests because "the AI wrote it." Success comes from treating AI as a new dependency: define what it can touch, measure the output, and review it like any other junior contributor. This guide gives you the governance framework, a complete sample project, and copy-paste templates to do it properly.

Signs Your Team Needs a Framework

Developers are using Copilot or Cursor but no one tracks what it generates or whether it improves velocity

Pull requests contain AI-generated code that no one can explain or justify at review time

Your board or CEO is asking "why aren't we vibe coding?" and you have no structured answer

A developer pasted proprietary code into a cloud AI tool and you do not know what data left your network

The Sample Project: TaskFlow API

All examples in this guide use a single, realistic sample project so you can copy-paste governance rules directly into your own codebase. TaskFlow is a small REST API for task management - complex enough to have real concerns, small enough to read in one sitting.

Project Overview

• Language: Python with FastAPI
• Database: PostgreSQL with SQLAlchemy ORM
• Architecture: Layered (handlers → services → repositories)
• Tests: pytest with coverage reporting
• Secrets: OAuth2 JWT authentication, database credentials via environment variables

Directory Structure

taskflow/
├── app/
│   ├── __init__.py
│   ├── main.py                          # FastAPI app factory
│   ├── config.py                        # Pydantic settings (reads env vars)
│   ├── dependencies.py                  # FastAPI dependency injection
│   ├── handlers/
│   │   ├── __init__.py
│   │   └── tasks.py                      # HTTP route handlers
│   ├── services/
│   │   ├── __init__.py
│   │   └── task_service.py              # Business logic
│   ├── repositories/
│   │   ├── __init__.py
│   │   └── task_repository.py           # Database access
│   ├── models/
│   │   ├── __init__.py
│   │   └── task.py                      # SQLAlchemy ORM models
│   └── schemas/
│       ├── __init__.py
│       └── task.py                      # Pydantic request/response schemas
├── tests/
│   ├── conftest.py
│   ├── test_handlers.py
│   └── test_task_service.py
├── alembic/                             # Database migrations
├── requirements.txt
├── Dockerfile
└── docker-compose.yml

What Makes This Representative

TaskFlow has every layer that causes debate in AI governance: route handlers (boilerplate), business logic (human-only), database models (partially assisted), and authentication (strictly human). If you can govern this codebase, you can govern yours.

Step 1: Write the Governance Rules

Before any developer opens an AI tool, write down what it is allowed to generate. This is your AI usage policy, and it lives in version control next to your code.

Every file in the repository sits in one of three tiers. The rules below codify what each tier allows.

Copy-Paste: AI Governance Policy

Create docs/AI_GOVERNANCE.md in your repository. This is the exact template we use with clients.

docs/AI_GOVERNANCE.md

# TaskFlow AI Governance Policy
# Version: 1.0
# Last reviewed: <date>
# Owner: Engineering Lead

## 1. Tool Allowlist

Only the following AI coding tools are approved for use on the TaskFlow codebase:

| Tool | Use Case | Approved By | Date |
|------|----------|-------------|------|
| GitHub Copilot | Autocomplete, inline suggestions | CTO | 2026-05-16 |
| GitHub Copilot Chat | Explaining error messages, generating docstrings | CTO | 2026-05-16 |

Unapproved tools include, but are not limited to: ChatGPT web interface for code generation, Claude web interface for production code, Cursor (pending data governance review).

## 2. File Classification

Every file in the repository is classified by risk level. AI assistance rules differ by level.

### GREEN - AI-assisted generation allowed
Directories:
- `tests/` (unit tests for existing code)
- `app/schemas/` (Pydantic request/response schemas)
- `docs/` (README, API documentation)
- Migration stubs in `alembic/versions/` (boilerplate only)

Rules:
- Developer must review every generated line before commit.
- Generated code must pass existing linting and type checking.
- Pull request description must flag which files contain AI-generated code.

### AMBER - AI-assisted with mandatory human review
Directories:
- `app/handlers/` (HTTP route handlers)
- `app/repositories/` (database access layers)

Rules:
- AI may generate scaffolding, but business logic branches (if/else, loops, calculations) must be written by a human.
- Every AI-generated block must be marked with a comment: `# AI-generated - reviewed by <name> on <date>`.
- Requires approval from a senior engineer (not the author) before merge.

### RED - Human-only authoring
Files and directories:
- `app/config.py` (environment variable handling, secrets)
- `app/dependencies.py` (authentication, authorization)
- `app/services/task_service.py` (core business logic)
- `docker-compose.yml` (infrastructure, database credentials)
- Any file containing:
  - Authentication or authorization logic
  - Cryptographic operations
  - Payment processing or financial calculations
  - PII handling or data export logic
  - Environment variable parsing or secret management

Rules:
- No AI-generated code in RED files.
- No exceptions without written sign-off from the CTO.
- RED files marked with a header comment: `# HUMAN-ONLY - see AI_GOVERNANCE.md`.

## 3. Data Sovereignty

Proprietary code and business logic must not leave the company's network.

- Cloud AI tools (GitHub Copilot) may only be used on GREEN files.
- AMBER and RED files must be edited with local-only tooling or with cloud AI features disabled.
- Developers must confirm in their pull request that no RED files were processed by cloud AI.
- Violation: first offence = warning. Second offence = tool access revoked.

## 4. Pull Request Requirements

Every PR containing AI-generated code must include:

1. **AI Usage Summary** - list every file with AI-generated code and the tool used.
2. **Verification Checklist** - the author ticks:
   - [ ] I reviewed every line of AI-generated code for correctness.
   - [ ] I ran the full test suite and all tests pass.
   - [ ] I confirmed no RED files were sent to cloud AI.
   - [ ] I added `# AI-generated` comments where required.
3. **Review Standard** - AI-generated code gets the same scrutiny as code from a junior developer on their first day.

## 5. Review Escalation

A senior engineer may block any PR that:
- Contains AI-generated business logic in AMBER files without clear justification.
- Modifies RED files without the `HUMAN-ONLY` header intact.
- Fails to answer "why this approach is correct" when asked in review.

No blame, no delay - just a request to rewrite the flagged section manually.

## 6. Incident Response

If AI-generated code causes a production incident:

1. Revert the change immediately (do not debug in production).
2. Root cause: was the tool, the prompt, or the review at fault?
3. Update this policy if the gap is structural.
4. Communicate learnings to the team within 24 hours.

How to Apply This to Your Project

Copy the file above into your repository. Replace `TaskFlow` with your project name. Map your directory structure to the GREEN / AMBER / RED classification. The only hard rule: if a bug in a file could cost money, expose data, or bring the system down, that file is RED.

Step 2: Measure the Baseline

You cannot prove AI tools improve velocity if you do not know your starting point. Measure before you introduce the tool.

Copy-Paste: Baseline Metrics Template

Create docs/AI_PILOT_BASELINE.md and fill in the numbers for your team.

docs/AI_PILOT_BASELINE.md

# AI Pilot Baseline - TaskFlow API
# Sprint: 24 (2026-05-01 to 2026-05-14)
# Measured by: Engineering Lead

## Baseline: Test Generation

| Metric | Value | Notes |
|--------|-------|-------|
| Average time to write tests for a new handler | 45 minutes | Measured over 3 handlers |
| Test coverage (pytest-cov) | 62% | Only happy paths covered |
| Tests written per sprint | 8 | Often deprioritised for features |
| Bugs found in production related to missing tests | 2 | Both edge case failures |

## Baseline: Documentation

| Metric | Value | Notes |
|--------|-------|-------|
| README last updated | 3 months ago | Out of sync with current endpoints |
| Inline docstrings in handlers | 15% | Most functions undocumented |
| API documentation generated | No | Swagger not configured |

## Baseline: Developer Sentiment

| Metric | Value | Notes |
|--------|-------|-------|
| Team size | 4 developers | 2 senior, 2 mid-level |
| Developers who have tried AI tooling | 1 | Junior dev used Copilot for personal project |
| Confidence in codebase quality (1-5) | 3 | Consistent feedback: "tests are thin" |

## Target Improvement (to be evaluated at end of pilot)

| Metric | Target | How we measure |
|--------|--------|----------------|
| Test generation time | -30% | Time-tracking on 3 new handlers |
| Test coverage | +15 percentage points | pytest-cov report |
| Tests written per sprint | +50% | Count in sprint retrospective |
| Production bugs from missing tests | 0 | Incident log |
| README accuracy | Current | Manual review by PM |
| Inline docstrings | +40 percentage points | Automated count via script |
| Developer confidence | +0.5 points | Anonymous survey (same 1-5 scale) |

Why these metrics matter: Time and coverage prove velocity. Sentiment proves adoption. If developers secretly hate the tool, the numbers will look good for a sprint and then collapse when the novelty wears off.

Step 3: Run a 2-Week Pilot

Two developers. One tool. One use case. This is the only way to separate signal from noise.

Copy-Paste: Pilot Charter

docs/AI_PILOT_CHARTER.md

# AI Tooling Pilot Charter - TaskFlow API
# Duration: 2 weeks (Sprint 25: 2026-05-15 to 2026-05-28)
# Approver: CTO

## Scope

- **Tool:** GitHub Copilot (IDE autocomplete + Copilot Chat for explanations)
- **Use case:** Test generation and inline documentation for GREEN files only
- **Participants:** 2 volunteer developers (1 senior, 1 mid-level)
- **Exclusions:** No AMBER or RED files. No multi-file edits via Copilot Chat. No cloud uploads of RED files.

## Daily Stand-Up Addition

The two pilot participants answer one extra question:
- "What did the AI tool generate for you today, and did you keep or rewrite it?"

This takes 30 seconds and surfaces patterns faster than any metric dashboard.

## Week 1 Check-In (scheduled, 30 minutes)

1. Review the governance compliance check:
   - Any RED files touched by AI? (expected: zero)
   - Any AMBER files incorrectly classified? (fix immediately)
   - Are `# AI-generated` comments present where expected?
2. Subjective feedback:
   - What is working better than expected?
   - What is slower or more frustrating?
   - Any surprises in the generated code quality?
3. Mid-pivot option:
   - If the tool is clearly wrong for the codebase, stop the pilot early. Document why.

## Sprint Retrospective (end of Week 2)

Measure against baseline. Fill in the same metrics table from `AI_PILOT_BASELINE.md`:

| Metric | Baseline | Pilot Sprint | Delta |
|--------|----------|--------------|-------|
| Test generation time | 45 min | ? | ? |
| Test coverage | 62% | ? | ? |
| Tests written per sprint | 8 | ? | ? |
| Inline docstrings | 15% | ? | ? |
| Developer confidence | 3.0/5 | ? | ? |

## Go / No-Go Decision

- **Expand:** All GREEN targets met, developers want to continue, no governance violations.
- **Pivot:** Results are mixed - try a different use case (e.g., documentation instead of tests).
- **Stop:** No measurable improvement or a governance violation occurred. Document findings and revisit in 3 months.

Two weeks. Three checkpoints. Stop early if the data turns red.

What a Healthy Pilot Looks Like

✔ The senior developer finds Copilot useful for boilerplate but rewrites 40% of the generated assertions
✔ The mid-level developer writes edge-case tests they would have skipped before
✔ No RED files were processed by AI - confirmed by IDE logs
✔ Both developers can explain every line they committed
✔ Coverage increased from 62% to 78%

What a Failed Pilot Looks Like

✗ A developer used Copilot to generate the entire `task_service.py` business logic layer
✗ Generated tests pass but do not actually assert anything meaningful
✗ A RED file (config.py) was sent to Copilot Chat by mistake
✗ Coverage stayed flat because developers stopped writing tests manually and only use AI stubs
✗ A senior engineer rejects a PR because the AI-generated code is harder to review than hand-written code

Step 4: Review AI-Generated Code Properly

The most important skill in tooling assisted development is not prompting - it is critiquing. Here are real examples from the TaskFlow codebase.

Example 1: GOOD - AI-Generated Tests With Human Verification

Input prompt to Copilot (in a test file):

# Prompt: Write pytest tests for task_service.create_task
# that cover: valid input, duplicate title, missing required field.

AI-generated output (after developer review and edits):

def test_create_task_success(mock_db):
    # AI-generated - reviewed by alice on 2026-05-16
    service = TaskService(mock_db)
    task = service.create_task(
        title="Write tests",
        description="Cover edge cases",
        owner_id=1
    )
    assert task.title == "Write tests"
    assert task.status == "todo"
    mock_db.add.assert_called_once()
    mock_db.commit.assert_called_once()

def test_create_task_duplicate_title(mock_db):
    # AI-generated - reviewed by alice on 2026-05-16
    # EDIT: added explicit assertion for error message
    service = TaskService(mock_db)
    mock_db.query.return_value.filter.return_value.first.return_value = Task(id=1)

    with pytest.raises(ValueError) as exc_info:
        service.create_task(title="Duplicate", description="x", owner_id=1)

    assert "already exists" in str(exc_info.value)  # <- human added this line

Why this works: The AI generated the structure and three test cases. The human verified that the assertions actually test meaningful behaviour (not just "does it run?"). The `# AI-generated` comment flags it for reviewers. The `EDIT` note shows what the human changed - a pattern that builds trust in the codebase.

Example 2: BAD - AI-Generated Business Logic in an AMBER File

Input prompt in `app/handlers/tasks.py`:

# Prompt: Add a PATCH endpoint to update task status

AI-generated output (rejected in review):

@router.patch("/tasks/{task_id}/status")
def update_task_status(task_id: int, status: str, db: Session = Depends(get_db)):
    # AI-generated - reviewed by bob on 2026-05-17
    task = db.query(Task).filter(Task.id == task_id).first()
    if not task:
        raise HTTPException(status_code=404, detail="Not found")
    task.status = status
    db.commit()
    return task

Why this was rejected: three issues the AI missed: 1. No input validation on `status` - any string is accepted, breaking the status enum contract and allowing invalid state transitions (e.g., a fresh task jumping straight to archived). 2. No authorisation check - any authenticated user can update any task, including tasks owned by other users (a textbook IDOR vulnerability). 3. No audit logging or event emission for status changes, which breaks downstream reporting. The reviewer rewrote the handler manually with a service-layer call, an enum-validated request schema, and an ownership check. The lesson: AI is fine for the route decorator and response shape (GREEN-level scaffolding), but the conditional logic and security surface must be human.

Example 3: RED - What Never Gets Generated

This file stays human-only. No AI tool goes near it.

# HUMAN-ONLY - see AI_GOVERNANCE.md
# app/config.py

from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    database_url: str
    jwt_secret: str
    jwt_algorithm: str = "HS256"
    jwt_expiry_hours: int = 24

    class Config:
        env_file = ".env"

settings = Settings()

If an AI tool suggests changing the JWT algorithm or adding a default secret, that suggestion is ignored. The `# HUMAN-ONLY` header reminds every developer who opens the file that this is governed territory.

Step 5: Mature the Governance

After the pilot, the rules evolve. Here is how the TaskFlow governance matures over three months.

Governance starts as a manual checklist and matures into automated guardrails - but always starts manual.

Month 1: Pilot (2 developers, 1 tool, 1 use case)

• AI Governance Policy v1.0 live in `docs/AI_GOVERNANCE.md`
• GREEN classification covers tests and schemas only
• Daily stand-up includes one AI usage question
• All AI-generated PRs flagged manually by author

Month 2: Expand (full team, same use case, add documentation)

• GREEN classification expanded to include handler scaffolding (no logic branches)
• Automated check: linter scans for `# AI-generated` comments in AMBER files missing a reviewer name
• Team training session: "How to critique AI output" - 1 hour workshop
• Monthly review: Engineering Lead checks for RED file violations via IDE logs

Month 3: Integrate (AI as standard dependency)

• AI Governance Policy v2.0 - refined based on 60 days of data
• CI pipeline includes automated detection: if a RED file changes, verify no AI tool was active in the author's IDE session
• Onboarding doc updated: new developers read AI governance before writing code
• Quarterly review: revisit metrics, adjust classification, retire or approve new tools

The principle: Governance starts as a manual checklist and matures into an automated guardrail. But it always starts manual. You cannot automate what you have not understood.

What Egon Expert Delivers

This guide gives you the framework to do this yourself. If you want it done faster and with lower risk, we deliver the full programme in four stages.

1. Workflow Audit

We spend two days in your codebase mapping your directory structure, current testing practices, and code review standards. We classify every major module into GREEN / AMBER / RED. You get a written AI governance policy tailored to your project, not a generic template.

2. Governance Framework

We write the complete set of copy-paste documents: AI_GOVERNANCE.md, classification rules, PR checklist templates, and incident response procedures. These are checked into your repository and reviewed with your senior engineers before any tool is introduced.

3. Structured Pilot

We design and run a 2-week pilot with 2-3 volunteer developers. We set baselines, define success criteria, and facilitate the daily check-ins and end-of-sprint retrospective. You get a data-driven go / no-go recommendation, not a sales pitch.

4. Team Training

We run a hands-on workshop for your full engineering team: how to prompt effectively, how to critique AI output, and how to maintain the governance framework as your codebase evolves. The goal is self-sufficiency, not dependency on us.

Book a Free Consultation

Ready to Introduce AI Tools Properly?

We help engineering teams adopt AI coding tools with the governance and measurement that protects your codebase.

Book a Free Consultation 020 8050 4565

16 May 2026

How to IntroduceTooling Assisted Development