২০ জুল, ২০২৫·8 মিনিট পড়তে

নো-কোড অ্যাপের ইনসিডেন্ট রানবুক: সনাক্ত, ত্রিয়াজ, পুনরুদ্ধার

এই ইনসিডেন্ট রানবুক নো-কোড অ্যাপের সমস্যা দ্রুত শনাক্ত, ত্রিয়াজ, নিরাপদ রোলব্যাক ও পুনরুদ্ধার, স্পষ্ট যোগাযোগ, এবং পুনরাবৃত্তি রোধ করতে সাহায্য করে।

What this runbook is and when to use it

An incident is any unexpected problem that stops people from using your app, makes it painfully slow, or puts data at risk. In no-code apps, that might look like sudden login failures, broken screens after a change, background automations that stop firing, API errors, or “successful” workflows that quietly write the wrong values into the database.

A written runbook turns a stressful moment into a set of small, clear actions. It reduces guesswork, speeds up decisions (like when to roll back), and helps everyone share the same facts. Most delays during incidents aren’t technical. They come from uncertainty: Is it real? Who’s leading? What changed? What do we tell users?

This playbook is for anyone who touches the app when things go wrong: builders who ship changes, ops or platform owners who manage deployments and access, support teams who hear the first reports, and product or business owners who judge impact and priorities.

It’s intentionally lightweight, including for teams building on platforms like AppMaster where you may have visual logic, generated services, and multiple deployment options.

It covers the full incident loop: detect and confirm a real issue, triage fast, stabilize and recover (including rollback decisions), communicate during the outage, then run a short post-incident review so the same problem is less likely to happen again.

It does not cover long-term architecture redesign, deep security forensics, or complex compliance procedures. If you handle regulated data or critical infrastructure, add stricter steps on top of this runbook.

Before anything breaks: set your baseline and roles

Incidents feel chaotic when you don’t know what “normal” looks like. Define your baseline so the team can spot real problems quickly. For a no-code app, early signals usually come from a mix of platform health, business metrics, and people.

Write down the signals you’ll watch every day, not just during outages. Common ones include uptime, error rate, slow screens, failed logins, payment failures, and spikes in support tickets or user messages.

Define severity in plain language so anyone can use it:

SEV1: Most users can’t use the app, or money/security is at risk.
SEV2: A key feature is broken, but there’s a workaround.
SEV3: Minor issues, limited users, or cosmetic bugs.

Set response targets that create momentum. Example targets: acknowledge within 5 minutes, post the first update within 15 minutes, and aim to stabilize within 60 minutes (even if the full fix takes longer).

Decide roles before you need them. Name who can declare an incident, who leads it, and who is backup if that person is offline. On AppMaster teams, that’s often the person who owns the Business Process logic, plus a backup who can handle deployments or exports.

Finally, keep one shared place for incident notes. Use timestamps for every action (what changed, when, by whom) so you can reconstruct the story later without guessing.

Detect and confirm: is this real and how bad is it

Confirm impact before you stare at dashboards. Ask one clear question: who can’t do what right now? “Support team can’t open tickets” is more useful than “the app is slow.” If you can, reproduce the problem using the same role and device as the affected user.

Next, work out how wide it is. Is it one account, a customer segment, or everyone? Do quick splits: region, account type, web vs mobile, and a single feature vs the whole app. In no-code tools, something can look global when it’s really a permission rule or one broken screen.

Then check what changed. Look back 1-2 hours for a release, a config toggle, a database schema edit, or a data import. On platforms like AppMaster, changes to business processes, data models, or auth settings can affect many flows at once, even if the UI looks fine.

Before you blame your app, rule out external dependencies. Email/SMS providers, payments (like Stripe), and integrations (Telegram, AWS services, AI APIs) can fail or rate-limit. If the app breaks only when sending messages or charging cards, the root problem may be upstream.

Use a simple decision checklist:

Monitor if impact is low and errors aren’t increasing.
Mitigate now if users are blocked from core tasks or data is at risk.
Declare an incident if the issue is widespread, time-sensitive, or unclear.
Escalate if the problem touches payments, authentication, or production data.
Set a check-in time (for example, every 15 minutes) so the team doesn’t drift.

Once you classify severity and scope, you can move from “is it real?” to “what do we do first?” without guessing.

Triage step-by-step (first 30 minutes)

Open an incident record immediately. Give it a plain title that names user impact, not the suspected cause (for example, “Checkout failing for EU customers”). Write down the start time (first alert or first report). This becomes the single place for decisions, timestamps, and what changed.

Assign roles so work doesn’t overlap. Even in a small team, naming owners reduces mistakes when stress is high. At minimum, you want:

Incident lead: keeps focus, sets priorities, decides contain vs rollback
Fixer: investigates and applies changes
Comms: posts updates to stakeholders and support
Note taker: logs actions, times, and outcomes

State two things in writing: what you know for sure, and your current hypothesis. “Known” might be: error rate spiked, a specific endpoint is failing, only mobile is affected. The hypothesis can be wrong, but it should guide the next test. Keep both updated as you learn.

While things are unstable, set a 15-minute update cadence. If nothing changed, say that. Regular updates stop side discussions and prevent duplicate “any news?” pings.

Choose the first containment action. The goal is to reduce harm fast, even if the root cause isn’t clear yet. Typical first moves include pausing background jobs, disabling a risky feature flag, limiting traffic to a module, or switching to a known-safe configuration. In AppMaster, this often means turning off a specific flow in the Business Process Editor or temporarily hiding a UI path that triggers failures.

If containment doesn’t improve metrics within one cadence window, start rollback planning in parallel.

Stabilize first: contain the impact

ফিক্সে টেকনিক্যাল ডেব্ট এড়ান

যখন চাহিদা বদলে যায় তখন কোড পুনরায় জেনারেট করুন যাতে পরে ঝাপসা হটফিক্স কম ঘটে।

Start Now

Once you confirm it’s a real incident, switch from “finding the bug” to “stopping the bleeding.” Stabilizing buys you time. It also protects users, revenue, and data while you investigate.

Start with the smallest change that reduces harm. Containment is often faster than a full fix because you can disable a new feature, pause a workflow, or block a risky input path without a rebuild.

If you suspect data is being corrupted, stop writes first. That can mean temporarily disabling forms, pausing automations that update records, or blocking an API endpoint that accepts updates. Reading bad data is painful, but writing bad data multiplies the cleanup.

If users are locked out, treat login as the top priority. Check authentication settings and the login flow before anything else. Every other fix is slower if users (and your own team) can’t access the app.

If the app is slow or timing out, reduce load and remove expensive paths. Turn off heavy screens, pause background jobs, and disable new integrations that spike requests. In AppMaster, containment might be as simple as disabling a problematic business process or temporarily removing a UI action that triggers a costly chain.

Keep actions deliberate and documented. Under pressure, teams repeat steps or undo a fix by accident. Write down each change and the result.

A simple stabilization sequence:

Stop data writes if corruption is possible, and confirm new records are no longer changing.
Disable the newest feature flag, automation, or integration involved in the timeline.
Protect access: restore login and session flow for admins first, then all users.
Reduce load by pausing batch jobs and removing the slowest user path.
Log every action with timestamp, owner, and observed effect.

You’re aiming for “safe and usable,” not “fully solved.” Once impact is contained, you can diagnose calmly and choose the right rollback or fix.

Rollback choices and risk checks

Panic থেকে Process-এ যান

রিয়েল ব্যাকএন্ড ও ফ্রন্টএন্ড কোড জেনারেট করুন যাতে ফিক্স ও রিডিপ্লয় predictable থাকে।

Test a Build

When something breaks, speed matters, but the safest move wins. You usually have three practical options: roll back, ship a forward fix, or do a partial revert (turn off one feature while leaving the rest).

First, be clear what “rollback” means in your setup. It might mean deploying the previous app version, reverting a config change, or restoring a database state. On platforms like AppMaster, a “version” can include backend logic, web UI, mobile builds, and environment settings.

Use these risk checks to decide whether rollback is safe:

Database schema changes: rollback may fail if the old version expects different tables or fields.
Irreversible data writes: refunds, status changes, or sent messages can’t be undone.
Queued jobs and webhooks: older logic may re-process items or fail on new payloads.
External dependencies: payment, email/SMS, or Telegram integrations may have changed behavior.

Set a simple go/no-go rule before you touch anything. Pick 2-3 metrics that must improve within 10-15 minutes after the action, such as error rate, login success, checkout completion, or API latency. If they don’t move the right way, stop and switch strategy.

Plan the backout of the rollback too. Know how you’ll undo it if the older version causes new issues: which build to redeploy, which config to re-apply, and who approves that second change. Keep one person responsible for the final “ship” decision so you don’t change course mid-step.

Communication during the incident

Silence makes incidents worse. Use a simple, repeatable way to keep people informed while the team investigates.

Start with internal updates. Tell the people who will get questions first, and the people who can remove blockers. Keep it short and factual. You typically need:

Support or customer success: what users are seeing and what to say right now
Sales or account teams: which accounts are affected and what not to promise
Builders/engineering: what changed, what’s being rolled back, who is on it
An exec point of contact: impact, risk, next update time
One owner who approves external wording

For external updates, stick to what you know. Avoid guessing the root cause or blaming a vendor. Users mostly want three things: confirmation, impact, and when you’ll update them again.

Simple message templates

Keep one status line consistent across channels:

Status: Investigating | Identified | Mitigating | Monitoring | Resolved
Impact: “Some users can’t log in” or “Payments fail for new orders”
Workaround: “Retry in 10 minutes” or “Use the mobile app while web is down” (only if true)
Next update: “Next update at 14:30 UTC”

If users are angry, acknowledge first, then be specific: “We know checkout is failing for some customers. We are rolling back the last change now. Next update in 30 minutes.” Don’t promise deadlines, credits, or permanent fixes during the incident.

Resolved vs monitoring

Declare resolved only when the main symptom is gone and key checks are clean (logins, core flows, error rates). Use monitoring when you’ve applied a fix (for example, rolling back a deployment or restoring a configuration) but you still need time to watch for repeats. Always state what you’ll monitor, for how long, and when the final update will be posted.

Diagnose the cause: fast checks that narrow it down

চলমানওয়ার্কফ্লো নিয়ন্ত্রণ করুন

সমস্যার সময় দ্রুত স্থগিত বা অ্যাডজাস্ট করার জন্য গুরুত্বপূর্ণ ফ্লোগুলোকে ভিজ্যুয়াল লজিকে রূপ দিন।

Start Building

Once things are stable, switch from firefighting to gathering the smallest set of facts that explains the symptoms. The goal isn’t a perfect root cause. It’s a likely cause you can act on without making the incident worse.

Different symptoms point to different suspects. Slow pages often mean slow database queries, a sudden traffic spike, or an external service lagging. Timeouts can come from a stuck process, an overloaded backend, or an integration that’s waiting too long. A spike in errors or retries often tracks back to a recent change, a bad input, or an upstream outage.

Fast checks (15 minutes)

Run one real user journey end to end with a normal test account. This is often the fastest signal because it touches UI, logic, database, and integrations.

Focus on a handful of checks:

Reproduce one journey: sign in, perform the key action, confirm the result.
Pinpoint the slow/failing step: page load, API call, database save, webhook.
Check recent data: scan the last 20-50 records for duplicates, missing fields, or totals that don’t add up.
Validate integrations: recent payment attempts (for example, Stripe), webhook deliveries, and any messaging (email/SMS or Telegram).
Confirm change context: what was released, configured, or migrated right before the spike?

If you’re on AppMaster, this often maps cleanly to a Business Process step, a Data Designer change, or a deployment config change.

Decide: keep the mitigation or fix forward

If the quick checks point to a clear culprit, pick the safest move: keep the current mitigation in place, or apply a small permanent fix. Only remove rate limits, feature toggles, or manual workarounds after the journey succeeds twice and the error rate stays flat for a few minutes.

Example scenario: a failed release during business hours

It’s 10:15 a.m. on a Tuesday. A team ships a small change to a customer portal built on AppMaster. Within minutes, users start seeing blank pages after login, and new orders stop coming in.

Support notices three tickets with the same message: “Login works, then the portal never loads.” At the same time, monitoring shows a spike in 500 errors on the web app and a drop in successful API calls. You treat it as a real incident.

The incident lead does a quick confirmation: try logging in as a test user on desktop and mobile, and check the last deployment time. The timing matches the release, so you assume the latest change is involved until proven otherwise.

The first 30 minutes might look like this:

Contain: put the portal in maintenance mode (or temporarily disable the affected feature flag) to stop more users from hitting the broken flow.
Decide rollback: if the failure started right after the release and affects many users, roll back first.
Communicate: post a short internal update (what’s broken, impact, current action, next update time). Send a brief customer message that you’re aware and working on it.
Recover: redeploy the last known good version (or revert the specific module). Retest login, dashboard load, and one core action like “create ticket” or “place order.”
Monitor: watch error rate, login success, and support ticket volume for 10-15 minutes before declaring it stable.

By 10:40 a.m., errors return to normal. You keep an eye on metrics while support confirms new tickets slow down.

Afterward, the team does a short review: what caught this first (alerts vs support), what slowed you down (missing owner, unclear rollback steps), and what to change. A common improvement is adding a release smoke-test checklist for the portal’s top three flows and making rollback a documented, one-action step.

Common mistakes that make incidents worse

Rollback মাথায় রেখে তৈরি করুন

উৎপাদনে এমনভাবে অ্যাপ তৈরি করুন যাতে ডিপ্লয় এবং রোলব্যাক স্পষ্ট হয় যখন ইনসিডেন্ট ঘটে।

Try AppMaster

Most incidents get worse for one of two reasons: people let the system keep doing harm while they investigate, or they change too many things too quickly. This runbook is meant to protect you from both.

A common trap is investigating while the app is still writing bad data. If a workflow is looping, an integration is posting duplicates, or a permission bug is letting the wrong users edit records, pause the offending process first. In AppMaster, that might mean disabling a Business Process, turning off a module integration, or temporarily restricting access so the issue stops spreading.

Another trap is “fixing” by guessing. When several people click around and change settings, you lose the timeline. Even small edits matter during an incident. Agree on one driver, keep a simple change log, and avoid stacking tweaks on top of unknowns.

Mistakes that repeatedly cause longer outages:

Investigating first and containing later, while bad writes or duplicate actions continue
Making multiple changes at once without notes, so you can’t tell what helped or hurt
Waiting to communicate, or sending vague updates that create more questions than trust
Rolling back blindly without checking database state and any queued jobs, emails, or webhooks
Ending the incident without a clear verification step

Communication is part of recovery. Share what you know, what you don’t know, and when the next update will land. “We are rolling back and will confirm billing events are correct within 15 minutes” beats “We’re looking into it.”

Don’t close the incident just because errors stopped. Verify with a short checklist: key screens load, new records save correctly, critical automations run once, and backlogs (queues, retries, scheduled jobs) are drained or safely paused.

Quick checklist you can run under pressure

ইস্যুগুলো দ্রুত পুনরুৎপাদন করুন

ব্যর্থ ইউজার জার্নি দ্রুত পুনরুৎপাদন করুন এবং দীর্ঘ ডেভ সাইকেল ছাড়া iteration করুন।

Build a Prototype

When things break, your brain will try to do ten tasks at once. Use this to stay calm, keep people safe, and get service back.

Pin this section where your team will actually see it.

Confirm it’s real and scope the impact (5 minutes): Check whether alerts match what users report. Write down what’s failing (login, checkout, admin panel), who is affected, and since when. If you can, reproduce in a clean session (incognito or a test account).

Take one minute to name an incident owner. One person decides, everyone else supports.

Stabilize and contain (10 minutes): Stop the bleeding before hunting root cause. Disable the risky path (feature toggle, temporary banner, queue pauses) and test one key journey end to end. Pick the journey that matters most to the business, not the one that’s easiest to test.
Recover service (10-20 minutes): Choose the safest move: rollback to the last known good version or apply a minimal fix. On platforms like AppMaster, that may mean redeploying a previous build or reverting the last change, then confirming error rates and response times return to normal.
Communicate (throughout): Post a short status update with what’s impacted, what users should do, and the next update time. Brief support with a two-sentence script so everyone says the same thing.
Wrap up cleanly (before you forget): Record what happened, what you changed, and what time service recovered. Assign next steps with an owner and a due date (monitoring tweak, test gap, data cleanup, follow-up fix).

After the incident: learn, fix, and prevent repeats

An incident isn’t fully “done” when the app is back up. The fastest way to reduce future downtime is to capture what happened while it’s still fresh, then turn that learning into small, real changes.

Schedule a short post-incident review within 2-5 days. Keep it blameless and practical. The goal isn’t to find someone to blame. It’s to make the next incident easier to handle.

Write a record that someone can read months later: what users saw, when you detected it, what you tried, what worked, and when service returned. Include the root cause if you know it, and note contributing factors like missing alerts, unclear ownership, or confusing rollout steps.

Turn learnings into tasks with owners and due dates. Focus on the smallest changes that prevent the same failure:

Close monitoring gaps (add one alert or dashboard check that would have caught it earlier)
Add a guardrail (validation rule, rate limit, feature flag default, approval step)
Improve tests for the risky area (login, payments, data import, permissions)
Update the runbook with the exact steps you wish you had
Do a short training refresh for the on-call or app owners

Pick one prevention measure per incident, even if it’s small. “Any change to roles requires a second reviewer” or “Data migrations must run in a staging copy first” can prevent repeat outages.

Keep this runbook next to your build and release process. If you’re building with AppMaster, write down where each app is deployed (AppMaster Cloud, AWS, Azure, Google Cloud, or self-hosted), who can redeploy quickly, and who can roll back. If you want a single home for that documentation, keeping it alongside your AppMaster project notes (appmaster.io) makes it easier to find when minutes matter.

প্রশ্নোত্তর

যে কোনো অপ্রত্যাশিত সমস্যা যা মূল কাজগুলো থামিয়ে দেয়, অ্যাপকে ব্যবহার অযোগ্য ধীর করে দেয়, বা ভুল/অসুরক্ষিত ডেটা পরিবর্তনের ঝুঁকি তৈরি করে — সেই সব ঘটনার জন্য এটি ইনসিডেন্ট ভাবুন। যদি ব্যবহারকারীরা লগইন করতে না পারে, পেমেন্ট ব্যর্থ হয়, অটোমেশন থেমে যায়, বা রেকর্ড ভুলভাবে লেখা হচ্ছে — ইনসিডেন্ট হিসেবে চলুন এবং রানবুক অনুসরণ করুন।

প্রাথমিকভাবে ব্যবহারকারীর প্রভাব দেখুন: এখন কে কী করতে পারছে না, এবং কখন থেকে। তারপর একই রোল ও ডিভাইস দিয়ে সমস্যা পুনরুৎপাদন করুন, এবং দেখুন এটা কি একটি একক অ্যাকাউন্ট, একটি সেগমেন্ট, না সবার জন্য—এভাবে ভুল কারণে সময় অপচয় কমবে।

SEV1 তখন ঘোষণা করুন যখন বেশির ভাগ ব্যবহারকারী ব্লকড বা টাকা/নিরাপত্তা/ডেটার ঝুঁকি থাকে। SEV2 হল যখন একটি গুরুত্বপূর্ণ ফিচার নষ্ট হলেও workaround আছে; SEV3 হলো ছোট বা সীমিত সমস্যা। দ্রুত সিদ্ধান্ত নেওয়াই জরুরি, নিখুঁত হওয়া সর্বদা নয়।

একজন ইনসিডেন্ট লিড নিধারণ করুন যিনি চূড়ান্ত সিদ্ধান্ত নেবেন; তারপর একজন ফিক্সার, একজন কমিউনস এবং একজন নোট-টেকার রাখুন যাতে সবাই ওভারল্যাপ না করে বা অনিচ্ছাকৃত পরিবর্তন না করে। ছোট দলের ক্ষেত্রে একজন দুইটি ভূমিকা রাখতে পারেন, কিন্তু লিড স্পষ্ট থাকা উচিত।

কনটেইনমেন্ট মানে দ্রুত ক্ষতি বন্ধ করা, পুরো রুটকজ জানার আগেই। AppMaster-এ এটা সাধারণত নির্দিষ্ট Business Process নিষ্ক্রিয় করা, UI অ্যাকশন সাময়িকভাবে লুকিয়ে দেওয়া, বা লুপ করে খারাপ ডেটা লিখানো অটোমেশনটা থামানো হতে পারে।

রোলব্যাক করুন যখন সমস্যা সরাসরি রিলিজের পরে শুরু হয়েছে এবং আপনার কাছে একটা known-good ভার্সন আছে যা দ্রুত সার্ভিস পুনরুদ্ধার করে। ফরওয়ার্ড ফিক্স তখনই করুন যখন আপনি একটি ছোট, কম-ঝুঁকিপূর্ণ পরিবর্তন দ্রুত যাচাই করতে পারবেন।

নো-কোড অ্যাপে রোলব্যাক ঝুঁকিপূর্ণ যখন ডেটাবেস স্কিমা পরিবর্তিত হয়েছে, অপরিবর্তনীয় ডেটা লেখা হয়েছে, বা কিউ করা জব/ওয়েবহুকগুলো পুরনো লজিকে পুনরায় প্রসেস হতে পারে। এসব থাকলে আগে স্থিতিশীল করুন এবং পুরনো ভার্সন কী আশা করে তা নিশ্চিত করুন।

প্রথমে লিখন বন্ধ করুন যদি ডেটা দূষণের সন্দেহ থাকে—কারণ খারাপ লেখা ডেটা পরবর্তী পরিষ্কারের কাজ বাড়িয়ে দেয়। ব্যবহারিকভাবে এটি ফর্ম নিষ্ক্রিয় করা, আপডেট অটোমেশন পজ করা, বা আপডেট গ্রহন করা API এন্ডপয়েন্ট ব্লক করা হতে পারে।

নিয়মিত সম্প্রচারের ভিত্তিতে সংক্ষিপ্ত, তত্ত্বগত আপডেট পাঠান: কী প্রভাবিত হচ্ছে, আপনি কী করছেন, এবং পরবর্তী আপডেট কখন দিবেন। অনুমান করা বা ভেন্ডরকে দোষারোপ করা এড়ান; ব্যবহারকারীরা প্রধানত স্বচ্ছতা ও পূর্বনির্ধারিত আপডেট চান।

প্রধান উপসর্গ চলে গেলে এবং গুরুত্বপূর্ণ চেকগুলো পরিষ্কার হয়ে গেলে (লগইন, প্রধান ওয়ার্কফ্লো, এরর রেট) এটিকে resolved বলা যায়। যদি আপনি কোনো ফিক্স প্রয়োগ করে দেখছেন কিন্তু সম্পূর্ণ পুনরাবৃত্তির জন্য পর্যবেক্ষণ দরকার, তাহলে monitoring বলুন এবং কী পর্যবেক্ষণ করা হবে ও কতক্ষণ তা জানান।