Incident runbook for no-code apps: detect, triage, recover
Use this incident runbook for no-code apps to spot issues fast, triage impact, roll back safely, communicate clearly, and prevent repeats.

What this runbook is and when to use it
An incident is any unexpected problem that stops people from using your app, makes it painfully slow, or puts data at risk. In no-code apps, that might look like sudden login failures, broken screens after a change, background automations that stop firing, API errors, or âsuccessfulâ workflows that quietly write the wrong values into the database.
A written runbook turns a stressful moment into a set of small, clear actions. It reduces guesswork, speeds up decisions (like when to roll back), and helps everyone share the same facts. Most delays during incidents arenât technical. They come from uncertainty: Is it real? Whoâs leading? What changed? What do we tell users?
This playbook is for anyone who touches the app when things go wrong: builders who ship changes, ops or platform owners who manage deployments and access, support teams who hear the first reports, and product or business owners who judge impact and priorities.
Itâs intentionally lightweight, including for teams building on platforms like AppMaster where you may have visual logic, generated services, and multiple deployment options.
It covers the full incident loop: detect and confirm a real issue, triage fast, stabilize and recover (including rollback decisions), communicate during the outage, then run a short post-incident review so the same problem is less likely to happen again.
It does not cover long-term architecture redesign, deep security forensics, or complex compliance procedures. If you handle regulated data or critical infrastructure, add stricter steps on top of this runbook.
Before anything breaks: set your baseline and roles
Incidents feel chaotic when you donât know what ânormalâ looks like. Define your baseline so the team can spot real problems quickly. For a no-code app, early signals usually come from a mix of platform health, business metrics, and people.
Write down the signals youâll watch every day, not just during outages. Common ones include uptime, error rate, slow screens, failed logins, payment failures, and spikes in support tickets or user messages.
Define severity in plain language so anyone can use it:
- SEV1: Most users canât use the app, or money/security is at risk.
- SEV2: A key feature is broken, but thereâs a workaround.
- SEV3: Minor issues, limited users, or cosmetic bugs.
Set response targets that create momentum. Example targets: acknowledge within 5 minutes, post the first update within 15 minutes, and aim to stabilize within 60 minutes (even if the full fix takes longer).
Decide roles before you need them. Name who can declare an incident, who leads it, and who is backup if that person is offline. On AppMaster teams, thatâs often the person who owns the Business Process logic, plus a backup who can handle deployments or exports.
Finally, keep one shared place for incident notes. Use timestamps for every action (what changed, when, by whom) so you can reconstruct the story later without guessing.
Detect and confirm: is this real and how bad is it
Confirm impact before you stare at dashboards. Ask one clear question: who canât do what right now? âSupport team canât open ticketsâ is more useful than âthe app is slow.â If you can, reproduce the problem using the same role and device as the affected user.
Next, work out how wide it is. Is it one account, a customer segment, or everyone? Do quick splits: region, account type, web vs mobile, and a single feature vs the whole app. In no-code tools, something can look global when itâs really a permission rule or one broken screen.
Then check what changed. Look back 1-2 hours for a release, a config toggle, a database schema edit, or a data import. On platforms like AppMaster, changes to business processes, data models, or auth settings can affect many flows at once, even if the UI looks fine.
Before you blame your app, rule out external dependencies. Email/SMS providers, payments (like Stripe), and integrations (Telegram, AWS services, AI APIs) can fail or rate-limit. If the app breaks only when sending messages or charging cards, the root problem may be upstream.
Use a simple decision checklist:
- Monitor if impact is low and errors arenât increasing.
- Mitigate now if users are blocked from core tasks or data is at risk.
- Declare an incident if the issue is widespread, time-sensitive, or unclear.
- Escalate if the problem touches payments, authentication, or production data.
- Set a check-in time (for example, every 15 minutes) so the team doesnât drift.
Once you classify severity and scope, you can move from âis it real?â to âwhat do we do first?â without guessing.
Triage step-by-step (first 30 minutes)
Open an incident record immediately. Give it a plain title that names user impact, not the suspected cause (for example, âCheckout failing for EU customersâ). Write down the start time (first alert or first report). This becomes the single place for decisions, timestamps, and what changed.
Assign roles so work doesnât overlap. Even in a small team, naming owners reduces mistakes when stress is high. At minimum, you want:
- Incident lead: keeps focus, sets priorities, decides contain vs rollback
- Fixer: investigates and applies changes
- Comms: posts updates to stakeholders and support
- Note taker: logs actions, times, and outcomes
State two things in writing: what you know for sure, and your current hypothesis. âKnownâ might be: error rate spiked, a specific endpoint is failing, only mobile is affected. The hypothesis can be wrong, but it should guide the next test. Keep both updated as you learn.
While things are unstable, set a 15-minute update cadence. If nothing changed, say that. Regular updates stop side discussions and prevent duplicate âany news?â pings.
Choose the first containment action. The goal is to reduce harm fast, even if the root cause isnât clear yet. Typical first moves include pausing background jobs, disabling a risky feature flag, limiting traffic to a module, or switching to a known-safe configuration. In AppMaster, this often means turning off a specific flow in the Business Process Editor or temporarily hiding a UI path that triggers failures.
If containment doesnât improve metrics within one cadence window, start rollback planning in parallel.
Stabilize first: contain the impact
Once you confirm itâs a real incident, switch from âfinding the bugâ to âstopping the bleeding.â Stabilizing buys you time. It also protects users, revenue, and data while you investigate.
Start with the smallest change that reduces harm. Containment is often faster than a full fix because you can disable a new feature, pause a workflow, or block a risky input path without a rebuild.
If you suspect data is being corrupted, stop writes first. That can mean temporarily disabling forms, pausing automations that update records, or blocking an API endpoint that accepts updates. Reading bad data is painful, but writing bad data multiplies the cleanup.
If users are locked out, treat login as the top priority. Check authentication settings and the login flow before anything else. Every other fix is slower if users (and your own team) canât access the app.
If the app is slow or timing out, reduce load and remove expensive paths. Turn off heavy screens, pause background jobs, and disable new integrations that spike requests. In AppMaster, containment might be as simple as disabling a problematic business process or temporarily removing a UI action that triggers a costly chain.
Keep actions deliberate and documented. Under pressure, teams repeat steps or undo a fix by accident. Write down each change and the result.
A simple stabilization sequence:
- Stop data writes if corruption is possible, and confirm new records are no longer changing.
- Disable the newest feature flag, automation, or integration involved in the timeline.
- Protect access: restore login and session flow for admins first, then all users.
- Reduce load by pausing batch jobs and removing the slowest user path.
- Log every action with timestamp, owner, and observed effect.
Youâre aiming for âsafe and usable,â not âfully solved.â Once impact is contained, you can diagnose calmly and choose the right rollback or fix.
Rollback choices and risk checks
When something breaks, speed matters, but the safest move wins. You usually have three practical options: roll back, ship a forward fix, or do a partial revert (turn off one feature while leaving the rest).
First, be clear what ârollbackâ means in your setup. It might mean deploying the previous app version, reverting a config change, or restoring a database state. On platforms like AppMaster, a âversionâ can include backend logic, web UI, mobile builds, and environment settings.
Use these risk checks to decide whether rollback is safe:
- Database schema changes: rollback may fail if the old version expects different tables or fields.
- Irreversible data writes: refunds, status changes, or sent messages canât be undone.
- Queued jobs and webhooks: older logic may re-process items or fail on new payloads.
- External dependencies: payment, email/SMS, or Telegram integrations may have changed behavior.
Set a simple go/no-go rule before you touch anything. Pick 2-3 metrics that must improve within 10-15 minutes after the action, such as error rate, login success, checkout completion, or API latency. If they donât move the right way, stop and switch strategy.
Plan the backout of the rollback too. Know how youâll undo it if the older version causes new issues: which build to redeploy, which config to re-apply, and who approves that second change. Keep one person responsible for the final âshipâ decision so you donât change course mid-step.
Communication during the incident
Silence makes incidents worse. Use a simple, repeatable way to keep people informed while the team investigates.
Start with internal updates. Tell the people who will get questions first, and the people who can remove blockers. Keep it short and factual. You typically need:
- Support or customer success: what users are seeing and what to say right now
- Sales or account teams: which accounts are affected and what not to promise
- Builders/engineering: what changed, whatâs being rolled back, who is on it
- An exec point of contact: impact, risk, next update time
- One owner who approves external wording
For external updates, stick to what you know. Avoid guessing the root cause or blaming a vendor. Users mostly want three things: confirmation, impact, and when youâll update them again.
Simple message templates
Keep one status line consistent across channels:
- Status: Investigating | Identified | Mitigating | Monitoring | Resolved
- Impact: âSome users canât log inâ or âPayments fail for new ordersâ
- Workaround: âRetry in 10 minutesâ or âUse the mobile app while web is downâ (only if true)
- Next update: âNext update at 14:30 UTCâ
If users are angry, acknowledge first, then be specific: âWe know checkout is failing for some customers. We are rolling back the last change now. Next update in 30 minutes.â Donât promise deadlines, credits, or permanent fixes during the incident.
Resolved vs monitoring
Declare resolved only when the main symptom is gone and key checks are clean (logins, core flows, error rates). Use monitoring when youâve applied a fix (for example, rolling back a deployment or restoring a configuration) but you still need time to watch for repeats. Always state what youâll monitor, for how long, and when the final update will be posted.
Diagnose the cause: fast checks that narrow it down
Once things are stable, switch from firefighting to gathering the smallest set of facts that explains the symptoms. The goal isnât a perfect root cause. Itâs a likely cause you can act on without making the incident worse.
Different symptoms point to different suspects. Slow pages often mean slow database queries, a sudden traffic spike, or an external service lagging. Timeouts can come from a stuck process, an overloaded backend, or an integration thatâs waiting too long. A spike in errors or retries often tracks back to a recent change, a bad input, or an upstream outage.
Fast checks (15 minutes)
Run one real user journey end to end with a normal test account. This is often the fastest signal because it touches UI, logic, database, and integrations.
Focus on a handful of checks:
- Reproduce one journey: sign in, perform the key action, confirm the result.
- Pinpoint the slow/failing step: page load, API call, database save, webhook.
- Check recent data: scan the last 20-50 records for duplicates, missing fields, or totals that donât add up.
- Validate integrations: recent payment attempts (for example, Stripe), webhook deliveries, and any messaging (email/SMS or Telegram).
- Confirm change context: what was released, configured, or migrated right before the spike?
If youâre on AppMaster, this often maps cleanly to a Business Process step, a Data Designer change, or a deployment config change.
Decide: keep the mitigation or fix forward
If the quick checks point to a clear culprit, pick the safest move: keep the current mitigation in place, or apply a small permanent fix. Only remove rate limits, feature toggles, or manual workarounds after the journey succeeds twice and the error rate stays flat for a few minutes.
Example scenario: a failed release during business hours
Itâs 10:15 a.m. on a Tuesday. A team ships a small change to a customer portal built on AppMaster. Within minutes, users start seeing blank pages after login, and new orders stop coming in.
Support notices three tickets with the same message: âLogin works, then the portal never loads.â At the same time, monitoring shows a spike in 500 errors on the web app and a drop in successful API calls. You treat it as a real incident.
The incident lead does a quick confirmation: try logging in as a test user on desktop and mobile, and check the last deployment time. The timing matches the release, so you assume the latest change is involved until proven otherwise.
The first 30 minutes might look like this:
- Contain: put the portal in maintenance mode (or temporarily disable the affected feature flag) to stop more users from hitting the broken flow.
- Decide rollback: if the failure started right after the release and affects many users, roll back first.
- Communicate: post a short internal update (whatâs broken, impact, current action, next update time). Send a brief customer message that youâre aware and working on it.
- Recover: redeploy the last known good version (or revert the specific module). Retest login, dashboard load, and one core action like âcreate ticketâ or âplace order.â
- Monitor: watch error rate, login success, and support ticket volume for 10-15 minutes before declaring it stable.
By 10:40 a.m., errors return to normal. You keep an eye on metrics while support confirms new tickets slow down.
Afterward, the team does a short review: what caught this first (alerts vs support), what slowed you down (missing owner, unclear rollback steps), and what to change. A common improvement is adding a release smoke-test checklist for the portalâs top three flows and making rollback a documented, one-action step.
Common mistakes that make incidents worse
Most incidents get worse for one of two reasons: people let the system keep doing harm while they investigate, or they change too many things too quickly. This runbook is meant to protect you from both.
A common trap is investigating while the app is still writing bad data. If a workflow is looping, an integration is posting duplicates, or a permission bug is letting the wrong users edit records, pause the offending process first. In AppMaster, that might mean disabling a Business Process, turning off a module integration, or temporarily restricting access so the issue stops spreading.
Another trap is âfixingâ by guessing. When several people click around and change settings, you lose the timeline. Even small edits matter during an incident. Agree on one driver, keep a simple change log, and avoid stacking tweaks on top of unknowns.
Mistakes that repeatedly cause longer outages:
- Investigating first and containing later, while bad writes or duplicate actions continue
- Making multiple changes at once without notes, so you canât tell what helped or hurt
- Waiting to communicate, or sending vague updates that create more questions than trust
- Rolling back blindly without checking database state and any queued jobs, emails, or webhooks
- Ending the incident without a clear verification step
Communication is part of recovery. Share what you know, what you donât know, and when the next update will land. âWe are rolling back and will confirm billing events are correct within 15 minutesâ beats âWeâre looking into it.â
Donât close the incident just because errors stopped. Verify with a short checklist: key screens load, new records save correctly, critical automations run once, and backlogs (queues, retries, scheduled jobs) are drained or safely paused.
Quick checklist you can run under pressure
When things break, your brain will try to do ten tasks at once. Use this to stay calm, keep people safe, and get service back.
Pin this section where your team will actually see it.
- Confirm itâs real and scope the impact (5 minutes): Check whether alerts match what users report. Write down whatâs failing (login, checkout, admin panel), who is affected, and since when. If you can, reproduce in a clean session (incognito or a test account).
Take one minute to name an incident owner. One person decides, everyone else supports.
-
Stabilize and contain (10 minutes): Stop the bleeding before hunting root cause. Disable the risky path (feature toggle, temporary banner, queue pauses) and test one key journey end to end. Pick the journey that matters most to the business, not the one thatâs easiest to test.
-
Recover service (10-20 minutes): Choose the safest move: rollback to the last known good version or apply a minimal fix. On platforms like AppMaster, that may mean redeploying a previous build or reverting the last change, then confirming error rates and response times return to normal.
-
Communicate (throughout): Post a short status update with whatâs impacted, what users should do, and the next update time. Brief support with a two-sentence script so everyone says the same thing.
-
Wrap up cleanly (before you forget): Record what happened, what you changed, and what time service recovered. Assign next steps with an owner and a due date (monitoring tweak, test gap, data cleanup, follow-up fix).
After the incident: learn, fix, and prevent repeats
An incident isnât fully âdoneâ when the app is back up. The fastest way to reduce future downtime is to capture what happened while itâs still fresh, then turn that learning into small, real changes.
Schedule a short post-incident review within 2-5 days. Keep it blameless and practical. The goal isnât to find someone to blame. Itâs to make the next incident easier to handle.
Write a record that someone can read months later: what users saw, when you detected it, what you tried, what worked, and when service returned. Include the root cause if you know it, and note contributing factors like missing alerts, unclear ownership, or confusing rollout steps.
Turn learnings into tasks with owners and due dates. Focus on the smallest changes that prevent the same failure:
- Close monitoring gaps (add one alert or dashboard check that would have caught it earlier)
- Add a guardrail (validation rule, rate limit, feature flag default, approval step)
- Improve tests for the risky area (login, payments, data import, permissions)
- Update the runbook with the exact steps you wish you had
- Do a short training refresh for the on-call or app owners
Pick one prevention measure per incident, even if itâs small. âAny change to roles requires a second reviewerâ or âData migrations must run in a staging copy firstâ can prevent repeat outages.
Keep this runbook next to your build and release process. If youâre building with AppMaster, write down where each app is deployed (AppMaster Cloud, AWS, Azure, Google Cloud, or self-hosted), who can redeploy quickly, and who can roll back. If you want a single home for that documentation, keeping it alongside your AppMaster project notes (appmaster.io) makes it easier to find when minutes matter.
FAQ
Use it anytime an unexpected issue blocks core tasks, makes the app unusably slow, or risks incorrect or unsafe data changes. If users canât log in, payments fail, automations stop, or records are being written incorrectly, treat it as an incident and follow the runbook.
Start with user impact: who canât do what right now, and since when. Then reproduce it with the same role and device, and check whether itâs one account, a segment, or everyone so you donât chase the wrong cause.
Declare SEV1 when most users are blocked or money/security/data is at risk. Use SEV2 when a key feature is broken but thereâs a workaround, and SEV3 for minor or limited-scope issues; deciding quickly matters more than being perfect.
Pick one incident lead who makes the final calls, then assign a fixer, a comms owner, and a note taker so people donât overlap or change things accidentally. If the team is small, one person can hold two roles, but the incident lead role should stay clear.
Containment is about stopping harm fast, even if the root cause is still unclear. In AppMaster, that often means disabling a specific Business Process, temporarily hiding a UI action that triggers failures, or pausing an automation that is looping or writing bad data.
Rollback when the issue started right after a release and you have a known-good version that restores service quickly. Choose a forward fix only when you can make a small, low-risk change and verify it fast without risking more downtime.
Treat rollback as risky if the database schema changed, if irreversible writes happened, or if queued jobs and webhooks might be re-processed by older logic. If any of those are true, stabilize first and confirm what the older version expects before redeploying it.
Stop writes first if corruption is possible, because bad writes multiply cleanup work. Practically, disable forms, pause update automations, or block update endpoints until you can confirm new records are no longer being changed incorrectly.
Send short, factual updates on a fixed cadence with whatâs impacted, what youâre doing, and when the next update will be. Avoid guessing the cause or blaming vendors; users and stakeholders mainly need clarity and predictable updates.
Consider it resolved only after the main user symptom is gone and key checks are clean, like login, the primary workflow, and error rates. If you fixed something but still need time to watch for repeats, call it monitoring and say what youâre watching and for how long.


