0003. E2E Test Suite Rebuild ​
Date: 2026-04-20
Author: Samantha Greenham
Stakeholders: Engineering, QA
Status: Proposed
Context ​
The existing e2eAppium/ suite has been built up incrementally over time and has accumulated structural issues that make incremental improvement slower and riskier than a clean rebuild. An audit identified low confidence in test results, known failures, and coverage gaps across critical journeys.
Significant infrastructure investment has already been made: a debug panel in the QA build, retirement of the E2E-specific build, GitHub Actions self-hosted runner support, artifact capture on failure, and Claude skills for CI diagnosis.
Specific issues found in the audit:
- 66 test invocations rely on finding a random existing job rather than creating one, causing silent skips when jobs are unavailable rather than explicit failures
- A factory pattern refactor was started but not completed; only ~5 tests have been migrated, leaving the remaining 100+ on inconsistent inline setup
- Photo upload steps are duplicated across 5 files with near-identical structure
- Screen objects mix UI interactions, API calls, and platform logic in single files of 300-400 lines
browser.pause()with magic numbers used in place of proper element waits- Platform checks (
browser.isIOS) scattered throughout tests, screens, and steps with no centralised compatibility layer
Attempting to backfill fixes across 100+ test files, 40+ screen files, and 30+ step files would be high-risk and slow, while QA continues to rely on the existing suite.
Decision ​
Build a new E2E suite in e2e/, from scratch, running alongside the existing e2eAppium/ suite. The new suite starts empty and grows only with tests that pass on both iOS and Android. QA continues to rely on e2eAppium/ until coverage has ported across.
The suite is shaped around a small number of building blocks, each enforced by custom ESLint rules so patterns stay consistent as contributors and AI-generated tests add more files.
Journey-based tests. One user journey per test file. The it blocks inside a file are ordered checkpoints on shared state rather than independent tests: setup runs once in before, each checkpoint picks up where the previous one left off, and a single failure stops the rest of the file. Block descriptions follow a grammar (<Subject>: when <precondition>, <subject> must <outcome> so <reason>) that concatenates to a readable sentence.
Screen objects in two tiers. Each screen has one class. Its public methods split into macros (business actions like login, submitJob) and primitives (low-level interactions prefixed click/enter/get/is/select/toggle). Tests call macros by preference. Platform branching, waiting, scrolling, and element lookup all live inside screen methods, never in tests.
Journeys for shared setup. Reusable setup flows live in helpers/journeys/. They come in three shapes: API-only (seed data), UI-only (drive through login, navigation), and combined. Tests call journeys from their before hook to reach a known starting state, or from an it block when the journey itself is the action under test.
API-seeded data, always. Every journey creates the data it needs via API. No test looks up existing records. Tests are self-contained and re-runnable in any environment.
No iOS skips, no magic waits. The same test runs on both platforms. Platform differences go inside the screen that knows about them. browser.pause() is banned; waiting goes through a shared helpers module that scrolls into view and retries.
Strict TypeScript and centralised element access. All UI interaction routes through one actions module (click, setText, waitForVisible, ...) which logs every interaction. Locators are private readonly on screen classes and never exposed to tests.
Claude-skill driven authoring. Three skills cover the full lifecycle: create a test from a plain English description, run it, debug failures. Skills read short per-topic convention files before generating code, so patterns stay consistent from the first line.
Interactive debugging ("keep-alive"). When a test fails locally, wdio pauses in its after hook and holds the Appium session open. The debug skill attaches to the live session from a throwaway script, drives the app through existing screen classes to see what actually happened, lands a fix in the right file, and re-runs. CI behaviour is unchanged: keep-alive is off by default and activates only via a flag.
CI from day one. Both Android and iOS tests run on GitHub Actions on every PR. BrowserStack configs are pre-wired for manual or nightly runs.
Directory layout ​
e2e/
├── screens/ # One class per screen (UI only)
│ ├── components/ # Reusable overlays (dialogs, permissions banners)
│ └── utils/ # Screen-only utilities (actions module)
├── tests/ # Mocha specs, one file per journey
├── helpers/
│ ├── api/ # API data seeding
│ └── journeys/ # Setup flows (API-only, UI-only, combined)
├── utils/ # Logger, date helpers, locator builders, ...
├── conventions/ # Per-topic convention docs read by Claude skills
├── eslint-rules/ # Custom lint rules enforcing the patterns above
├── types/ # Shared TypeScript types
├── constants/ # App-wide constants
└── config/ # WDIO + Appium config entry pointsTests from e2eAppium/ port across once they pass under the new patterns. No old test code is imported directly.
Alternatives Considered ​
Maestro. Evaluated for its simple YAML format and MCP integration, which would support tight AI iteration loops. Ruled out because its YAML format has no support for loops, conditionals, or reusable abstractions, making complex test logic impractical. Data seeding is a core requirement and Maestro has no native TypeScript layer; seeding would require a separate script, resulting in a split architecture that is harder to maintain.
Incremental refactor of e2eAppium/. Would avoid a coverage gap during transition but requires touching 170+ files with a high risk of breaking tests QA depends on. The structural issues are deep enough that backfilling is slower than rebuilding.
Differences from the Previous Suite ​
| Area | e2eAppium/ | e2e/ |
|---|---|---|
| Test data | Finds random existing jobs | Always creates jobs via API |
| Pattern consistency | Mixed; factories introduced late, most tests predate them | Journey model and two-tier screen classes, lint-enforced from day one |
| TypeScript | Strict config added retroactively | Strict from the start |
| Screen objects | Large files mixing UI, API, and platform logic | One class per screen, macros and primitives, UI only |
| Platform handling | browser.isIOS checks scattered throughout, frequent iOS skips | Branching inside screen methods, no iOS skips permitted |
| Test authoring | Manual | Claude skill: describe a journey in plain English |
| Test debugging | Claude skill reads CI artifacts | Claude skill attaches to a paused Appium session locally and iterates |
| Coverage growth | Manual, ad hoc | Structured: port from e2eAppium/ or add via skill |
Consequences ​
Easier:
- Every test in the suite passes on both platforms, no known failures to work around
- New tests can be added by describing a journey in plain English to a Claude skill
- Lint rules hold the patterns in place, so a reviewer does not need to memorise them
- No risk of breaking QA workflows during the transition
- Failures can be debugged against a live paused session rather than a static artifact
Harder:
- Coverage gap during the transition period while tests port across
- Two suites to run until
e2eAppium/is retired - Team needs time to get familiar with the new patterns and the skill-driven workflow
Claude Skills ​
Three skills drive the day-to-day loop:
| Skill | Trigger | Responsibility |
|---|---|---|
e2e-create-test | Plain English description of a journey | Decomposes the journey, finds or creates supporting screens and setup flows, writes the test, then hands off to run and debug |
e2e-run-test | Test file path + platform | Boots emulator/simulator if needed, runs the test with keep-alive enabled, returns structured pass/fail output |
e2e-debug-test | Failed test path + platform | Attaches to the paused Appium session, probes the live app, diagnoses the root cause, applies a minimal fix, re-runs |
The iteration loop is create → run → (on failure) debug → run, repeating up to 3 cycles before surfacing to the engineer. Additional supporting skills cover OTA updates when an app-side change is needed and scaffolding new lint rules when a new convention is added.