0003. E2E Test Suite Rebuild

Date: 2026-04-20

Author: Samantha Greenham

Stakeholders: Engineering, QA

Status: Proposed

Context

The existing e2eAppium/ suite has been built up incrementally over time and has accumulated structural issues that make incremental improvement slower and riskier than a clean rebuild. An audit identified low confidence in test results, known failures, and coverage gaps across critical journeys.

Significant infrastructure investment has already been made: a debug panel in the QA build, retirement of the E2E-specific build, GitHub Actions self-hosted runner support, artifact capture on failure, and Claude skills for CI diagnosis.

Specific issues found in the audit:

66 test invocations rely on finding a random existing job rather than creating one, causing silent skips when jobs are unavailable rather than explicit failures
A factory pattern refactor was started but not completed; only ~5 tests have been migrated, leaving the remaining 100+ on inconsistent inline setup
Photo upload steps are duplicated across 5 files with near-identical structure
Screen objects mix UI interactions, API calls, and platform logic in single files of 300-400 lines
browser.pause() with magic numbers used in place of proper element waits
Platform checks (browser.isIOS) scattered throughout tests, screens, and steps with no centralised compatibility layer

Attempting to backfill fixes across 100+ test files, 40+ screen files, and 30+ step files would be high-risk and slow, while QA continues to rely on the existing suite.

Decision

Build a new E2E suite in e2e/, from scratch, running alongside the existing e2eAppium/ suite. The new suite starts empty and grows only with tests that pass on both iOS and Android. QA continues to rely on e2eAppium/ until coverage has ported across.

The suite is shaped around a small number of building blocks, each enforced by custom ESLint rules so patterns stay consistent as contributors and AI-generated tests add more files.

Journey-based tests. One user journey per test file. The it blocks inside a file are ordered checkpoints on shared state rather than independent tests: setup runs once in before, each checkpoint picks up where the previous one left off, and a single failure stops the rest of the file. Block descriptions follow a grammar (<Subject>: when <precondition>, <subject> must <outcome> so <reason>) that concatenates to a readable sentence.

Screen objects in two tiers. Each screen has one class. Its public methods split into macros (business actions like login, submitJob) and primitives (low-level interactions prefixed click/enter/get/is/select/toggle). Tests call macros by preference. Platform branching, waiting, scrolling, and element lookup all live inside screen methods, never in tests.

Journeys for shared setup. Reusable setup flows live in helpers/journeys/. They come in three shapes: API-only (seed data), UI-only (drive through login, navigation), and combined. Tests call journeys from their before hook to reach a known starting state, or from an it block when the journey itself is the action under test.

API-seeded data, always. Every journey creates the data it needs via API. No test looks up existing records. Tests are self-contained and re-runnable in any environment.

No iOS skips, no magic waits. The same test runs on both platforms. Platform differences go inside the screen that knows about them. browser.pause() is banned; waiting goes through a shared helpers module that scrolls into view and retries.

Strict TypeScript and centralised element access. All UI interaction routes through one actions module (click, setText, waitForVisible, ...) which logs every interaction. Locators are private readonly on screen classes and never exposed to tests.

Claude-skill driven authoring. Three skills cover the full lifecycle: create a test from a plain English description, run it, debug failures. Skills read short per-topic convention files before generating code, so patterns stay consistent from the first line.

Interactive debugging ("keep-alive"). When a test fails locally, wdio pauses in its after hook and holds the Appium session open. The debug skill attaches to the live session from a throwaway script, drives the app through existing screen classes to see what actually happened, lands a fix in the right file, and re-runs. CI behaviour is unchanged: keep-alive is off by default and activates only via a flag.

CI from day one. Both Android and iOS tests run on GitHub Actions on every PR. BrowserStack configs are pre-wired for manual or nightly runs.

Directory layout

e2e/
├── screens/          # One class per screen (UI only)
│   ├── components/   # Reusable overlays (dialogs, permissions banners)
│   └── utils/        # Screen-only utilities (actions module)
├── tests/            # Mocha specs, one file per journey
├── helpers/
│   ├── api/          # API data seeding
│   └── journeys/     # Setup flows (API-only, UI-only, combined)
├── utils/            # Logger, date helpers, locator builders, ...
├── conventions/      # Per-topic convention docs read by Claude skills
├── eslint-rules/     # Custom lint rules enforcing the patterns above
├── types/            # Shared TypeScript types
├── constants/        # App-wide constants
└── config/           # WDIO + Appium config entry points

Tests from e2eAppium/ port across once they pass under the new patterns. No old test code is imported directly.

Alternatives Considered

Maestro. Evaluated for its simple YAML format and MCP integration, which would support tight AI iteration loops. Ruled out because its YAML format has no support for loops, conditionals, or reusable abstractions, making complex test logic impractical. Data seeding is a core requirement and Maestro has no native TypeScript layer; seeding would require a separate script, resulting in a split architecture that is harder to maintain.

Incremental refactor of e2eAppium/. Would avoid a coverage gap during transition but requires touching 170+ files with a high risk of breaking tests QA depends on. The structural issues are deep enough that backfilling is slower than rebuilding.

Differences from the Previous Suite

Area	`e2eAppium/`	`e2e/`
Test data	Finds random existing jobs	Always creates jobs via API
Pattern consistency	Mixed; factories introduced late, most tests predate them	Journey model and two-tier screen classes, lint-enforced from day one
TypeScript	Strict config added retroactively	Strict from the start
Screen objects	Large files mixing UI, API, and platform logic	One class per screen, macros and primitives, UI only
Platform handling	`browser.isIOS` checks scattered throughout, frequent iOS skips	Branching inside screen methods, no iOS skips permitted
Test authoring	Manual	Claude skill: describe a journey in plain English
Test debugging	Claude skill reads CI artifacts	Claude skill attaches to a paused Appium session locally and iterates
Coverage growth	Manual, ad hoc	Structured: port from `e2eAppium/` or add via skill

Consequences

Easier:

Every test in the suite passes on both platforms, no known failures to work around
New tests can be added by describing a journey in plain English to a Claude skill
Lint rules hold the patterns in place, so a reviewer does not need to memorise them
No risk of breaking QA workflows during the transition
Failures can be debugged against a live paused session rather than a static artifact

Harder:

Coverage gap during the transition period while tests port across
Two suites to run until e2eAppium/ is retired
Team needs time to get familiar with the new patterns and the skill-driven workflow

Claude Skills

Three skills drive the day-to-day loop:

Skill	Trigger	Responsibility
`e2e-create-test`	Plain English description of a journey	Decomposes the journey, finds or creates supporting screens and setup flows, writes the test, then hands off to run and debug
`e2e-run-test`	Test file path + platform	Boots emulator/simulator if needed, runs the test with keep-alive enabled, returns structured pass/fail output
`e2e-debug-test`	Failed test path + platform	Attaches to the paused Appium session, probes the live app, diagnoses the root cause, applies a minimal fix, re-runs

The iteration loop is create → run → (on failure) debug → run, repeating up to 3 cycles before surfacing to the engineer. Additional supporting skills cover OTA updates when an app-side change is needed and scaffolding new lint rules when a new convention is added.

0003. E2E Test Suite Rebuild ​

Context ​

Decision ​

Directory layout ​

Alternatives Considered ​

Differences from the Previous Suite ​

Consequences ​

Claude Skills ​