Skip to content

0003. E2E Test Suite Rebuild ​

Date: 2026-04-20

Author: Samantha Greenham

Stakeholders: Engineering, QA

Status: Proposed

Context ​

The existing e2eAppium/ suite has been built up incrementally over time and has accumulated structural issues that make incremental improvement slower and riskier than a clean rebuild. An audit identified low confidence in test results, known failures, and coverage gaps across critical journeys.

Significant infrastructure investment has already been made: a debug panel in the QA build, retirement of the E2E-specific build, GitHub Actions self-hosted runner support, artifact capture on failure, and Claude skills for CI diagnosis.

Specific issues found in the audit:

  • 66 test invocations rely on finding a random existing job rather than creating one, causing silent skips when jobs are unavailable rather than explicit failures
  • A factory pattern refactor was started but not completed; only ~5 tests have been migrated, leaving the remaining 100+ on inconsistent inline setup
  • Photo upload steps are duplicated across 5 files with near-identical structure
  • Screen objects mix UI interactions, API calls, and platform logic in single files of 300-400 lines
  • browser.pause() with magic numbers used in place of proper element waits
  • Platform checks (browser.isIOS) scattered throughout tests, screens, and steps with no centralised compatibility layer

Attempting to backfill fixes across 100+ test files, 40+ screen files, and 30+ step files would be high-risk and slow, while QA continues to rely on the existing suite.

Decision ​

Build a new E2E suite in e2e/, from scratch, running alongside the existing e2eAppium/ suite. The new suite starts empty and grows only with tests that pass on both iOS and Android. QA continues to rely on e2eAppium/ until coverage has ported across.

The suite is shaped around a small number of building blocks, each enforced by custom ESLint rules so patterns stay consistent as contributors and AI-generated tests add more files.

Journey-based tests. One user journey per test file. The it blocks inside a file are ordered checkpoints on shared state rather than independent tests: setup runs once in before, each checkpoint picks up where the previous one left off, and a single failure stops the rest of the file. Block descriptions follow a grammar (<Subject>: when <precondition>, <subject> must <outcome> so <reason>) that concatenates to a readable sentence.

Screen objects in two tiers. Each screen has one class. Its public methods split into macros (business actions like login, submitJob) and primitives (low-level interactions prefixed click/enter/get/is/select/toggle). Tests call macros by preference. Platform branching, waiting, scrolling, and element lookup all live inside screen methods, never in tests.

Journeys for shared setup. Reusable setup flows live in helpers/journeys/. They come in three shapes: API-only (seed data), UI-only (drive through login, navigation), and combined. Tests call journeys from their before hook to reach a known starting state, or from an it block when the journey itself is the action under test.

API-seeded data, always. Every journey creates the data it needs via API. No test looks up existing records. Tests are self-contained and re-runnable in any environment.

No iOS skips, no magic waits. The same test runs on both platforms. Platform differences go inside the screen that knows about them. browser.pause() is banned; waiting goes through a shared helpers module that scrolls into view and retries.

Strict TypeScript and centralised element access. All UI interaction routes through one actions module (click, setText, waitForVisible, ...) which logs every interaction. Locators are private readonly on screen classes and never exposed to tests.

Claude-skill driven authoring. Three skills cover the full lifecycle: create a test from a plain English description, run it, debug failures. Skills read short per-topic convention files before generating code, so patterns stay consistent from the first line.

Interactive debugging ("keep-alive"). When a test fails locally, wdio pauses in its after hook and holds the Appium session open. The debug skill attaches to the live session from a throwaway script, drives the app through existing screen classes to see what actually happened, lands a fix in the right file, and re-runs. CI behaviour is unchanged: keep-alive is off by default and activates only via a flag.

CI from day one. Both Android and iOS tests run on GitHub Actions on every PR. BrowserStack configs are pre-wired for manual or nightly runs.

Directory layout ​

e2e/
├── screens/          # One class per screen (UI only)
│   ├── components/   # Reusable overlays (dialogs, permissions banners)
│   └── utils/        # Screen-only utilities (actions module)
├── tests/            # Mocha specs, one file per journey
├── helpers/
│   ├── api/          # API data seeding
│   └── journeys/     # Setup flows (API-only, UI-only, combined)
├── utils/            # Logger, date helpers, locator builders, ...
├── conventions/      # Per-topic convention docs read by Claude skills
├── eslint-rules/     # Custom lint rules enforcing the patterns above
├── types/            # Shared TypeScript types
├── constants/        # App-wide constants
└── config/           # WDIO + Appium config entry points

Tests from e2eAppium/ port across once they pass under the new patterns. No old test code is imported directly.

Alternatives Considered ​

Maestro. Evaluated for its simple YAML format and MCP integration, which would support tight AI iteration loops. Ruled out because its YAML format has no support for loops, conditionals, or reusable abstractions, making complex test logic impractical. Data seeding is a core requirement and Maestro has no native TypeScript layer; seeding would require a separate script, resulting in a split architecture that is harder to maintain.

Incremental refactor of e2eAppium/. Would avoid a coverage gap during transition but requires touching 170+ files with a high risk of breaking tests QA depends on. The structural issues are deep enough that backfilling is slower than rebuilding.

Differences from the Previous Suite ​

Areae2eAppium/e2e/
Test dataFinds random existing jobsAlways creates jobs via API
Pattern consistencyMixed; factories introduced late, most tests predate themJourney model and two-tier screen classes, lint-enforced from day one
TypeScriptStrict config added retroactivelyStrict from the start
Screen objectsLarge files mixing UI, API, and platform logicOne class per screen, macros and primitives, UI only
Platform handlingbrowser.isIOS checks scattered throughout, frequent iOS skipsBranching inside screen methods, no iOS skips permitted
Test authoringManualClaude skill: describe a journey in plain English
Test debuggingClaude skill reads CI artifactsClaude skill attaches to a paused Appium session locally and iterates
Coverage growthManual, ad hocStructured: port from e2eAppium/ or add via skill

Consequences ​

Easier:

  • Every test in the suite passes on both platforms, no known failures to work around
  • New tests can be added by describing a journey in plain English to a Claude skill
  • Lint rules hold the patterns in place, so a reviewer does not need to memorise them
  • No risk of breaking QA workflows during the transition
  • Failures can be debugged against a live paused session rather than a static artifact

Harder:

  • Coverage gap during the transition period while tests port across
  • Two suites to run until e2eAppium/ is retired
  • Team needs time to get familiar with the new patterns and the skill-driven workflow

Claude Skills ​

Three skills drive the day-to-day loop:

SkillTriggerResponsibility
e2e-create-testPlain English description of a journeyDecomposes the journey, finds or creates supporting screens and setup flows, writes the test, then hands off to run and debug
e2e-run-testTest file path + platformBoots emulator/simulator if needed, runs the test with keep-alive enabled, returns structured pass/fail output
e2e-debug-testFailed test path + platformAttaches to the paused Appium session, probes the live app, diagnoses the root cause, applies a minimal fix, re-runs

The iteration loop is create → run → (on failure) debug → run, repeating up to 3 cycles before surfacing to the engineer. Additional supporting skills cover OTA updates when an app-side change is needed and scaffolding new lint rules when a new convention is added.