25 рдЬреВрди 2025┬╖8 рдорд┐рдирдЯ рдкрдврд╝рдиреЗ рдореЗрдВ

рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЗ рд▓рд┐рдП SLOs: рд╕рд░рд▓ рд╡рд┐рд╢реНрд╡рд╕рдиреАрдпрддрд╛ рд▓рдХреНрд╖реНрдп рдЬреЛ рдХрд╛рдо рдХрд░рддреЗ рд╣реИрдВ

рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЗ рд▓рд┐рдП рд╕рд░рд▓ SLOs: рдорд╛рдкрдиреЗ рдпреЛрдЧреНрдп рдЕрдкрдЯрд╛рдЗрдо рдФрд░ рд╡рд┐рд▓рдВрдмрддрд╛ рд▓рдХреНрд╖реНрдп рд░рдЦреЗрдВ, рдФрд░ рдЙрдиреНрд╣реЗрдВ рдРрд╕реЗ рдЕрд▓рд░реНрдЯ рд╕реЗ рдЬреЛрдбрд╝реЗрдВ рдЬрд┐рдиреНрд╣реЗрдВ рдЫреЛрдЯреА рдЯреАрдо рдмрд┐рдирд╛ рдЬрд▓рди рдХреЗ рд╕рдВрднрд╛рд▓ рд╕рдХреЗред

рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЗ рд▓рд┐рдП SLOs: рд╕рд░рд▓ рд╡рд┐рд╢реНрд╡рд╕рдиреАрдпрддрд╛ рд▓рдХреНрд╖реНрдп рдЬреЛ рдХрд╛рдо рдХрд░рддреЗ рд╣реИрдВ

Why internal tools need SLOs (even if only 20 people use them)

Internal tools feel small because the audience is small. The impact often isnтАЩt: if your ops dashboard is down, orders pause; if your support console is slow, customers wait; if your admin panel breaks, fixes stack up.

Without clear reliability targets, every outage becomes a debate. One person shrugs at a 10-minute glitch, another treats it like a crisis. You lose time to noisy chats, unclear priorities, and surprise work at the worst moment.

SLOs fix that by setting simple expectations you can measure. They answer two practical questions: what must work, and how well must it work for people to do their jobs.

The hidden cost of тАЬweтАЩll keep it pretty stableтАЭ shows up fast. Work stops while teams wait for a tool to recover. Support pings multiply because nobody knows whatтАЩs normal. Engineers get dragged into urgent fixes instead of planned improvements. Product owners stop trusting the system and start asking for manual backups. Small issues linger because they never cross a clear line.

You donтАЩt need a full reliability program. A small team can start with a few user-focused goals like тАЬlogin worksтАЭ or тАЬsearch results load fast,тАЭ plus a small set of alerts tied to real action.

This applies no matter how the tool is built. If youтАЩre using AppMaster (appmaster.io) to create internal apps, pick the actions people rely on, measure uptime and response time, and alert only when it affects work.

SLOs, SLIs, and SLAs in plain words

These three terms sound similar, but theyтАЩre different kinds of reliability language. Mixing them up is a common source of confusion.

An SLI (Service Level Indicator) is a measurement. ItтАЩs something you can count, like тАЬpercent of requests that succeededтАЭ or тАЬhow long the page took to load.тАЭ If you canтАЩt measure it reliably, itтАЩs not a good SLI.

An SLO (Service Level Objective) is the goal for that measurement. It answers: what level is good enough for users most of the time? SLOs help you decide what to fix first and what can wait.

An SLA (Service Level Agreement) is a promise, usually written down, often with consequences. Many internal tools donтАЩt need SLAs at all. They need clear goals, not legal-style commitments.

A quick example:

  • SLI (uptime): Percentage of minutes the tool is reachable.
  • SLO (uptime goal): 99.9% monthly uptime.
  • SLI (latency): p95 page load time for the dashboard.
  • SLO (latency goal): p95 under 2 seconds during business hours.

Notice whatтАЩs missing: тАЬnever downтАЭ or тАЬalways fast.тАЭ SLOs arenтАЩt about perfection. They make tradeoffs visible so a small team can choose between features, reliability work, and avoiding unnecessary toil.

A practical rule: if meeting the target would require heroics, itтАЩs not an SLO, itтАЩs wishful thinking. Start with something your team can maintain calmly, then tighten it later if users still feel pain.

Pick the few user actions that really matter

Internal tools fail in specific ways: the admin panel loads but saving a record spins forever; an ops dashboard opens but charts never refresh; a staff portal works except login breaks after an update. You get the most value by focusing on the actions people rely on every day, not every page and button.

Start by naming the tool type, because it hints at the critical paths. Admin panels are about тАЬchange something safely.тАЭ Ops dashboards are about тАЬsee whatтАЩs happening now.тАЭ Portals are about тАЬget in, find info, and submit a request.тАЭ

Then write down the top user journeys in plain language. A good starting set:

  • Login and reach the home screen
  • Search or filter and get results
  • Submit a form (create/update) and see a success message
  • Load the main dashboard view with fresh data
  • Export or download the report people use for daily work

For each journey, define what counts as failure. Be strict and measurable: a 500 error is a failure, but so is a timeout, a page that never finishes loading, or a form that returns success while the data is missing.

Keep the scope small at first. Pick 1 to 3 journeys that match real pain and real risk. If on-call pages are usually тАЬnobody can log inтАЭ and тАЬthe save button hangs,тАЭ start with Login and Submit form. Add Search later once you trust the measurements and the alerts.

Choose SLIs you can actually measure

Good SLIs are boring. They come from data you already have, and they match what users feel when the tool works or fails. If you need a whole new monitoring setup just to measure them, pick simpler SLIs.

Start with availability in terms people understand: can I open the tool and can I finish the task? For many internal tools, two SLIs cover most pain:

  • Uptime for the tool (is it reachable and responding)
  • Success rate for 1 to 3 key actions (login, search, save, approve)

Then add latency, but keep it narrow. Choose one or two screens or endpoints that represent the wait users complain about, like loading the dashboard or submitting a form. Measuring everything usually creates noise and arguments.

Decide the measurement window up front. A rolling 30 days is common for steady tools; weekly can work when you release often and want fast feedback. Whatever you choose, stick with it so trends mean something.

Finally, pick one source of truth per SLI and write it down:

  • Synthetic checks (a bot hits a health check or runs a simple flow)
  • Server metrics (request counts, errors, latency from your backend)
  • Logs (count тАЬsuccessтАЭ vs тАЬfailedтАЭ events for a specific action)

Example: if your internal app is built on AppMaster, you can measure uptime with a synthetic ping to the backend, success rate from API responses, and latency from backend request timings. The key is consistency, not perfection.

Set realistic uptime and latency SLOs

рдПрдХ рднрд░реЛрд╕реЗрдордВрдж рдЖрдВрддрд░рд┐рдХ рдкреЛрд░реНрдЯрд▓ рднреЗрдЬреЗрдВ
рдСрдереЗрдВрдЯрд┐рдХреЗрд╢рди рдФрд░ рд╡рд░реНрдХрдлрд╝реНрд▓реЛ рдХреЗ рд╕рд╛рде рдПрдХ рд╕реНрдЯрд╛рдл рдкреЛрд░реНрдЯрд▓ рд▓реЙрдиреНрдЪ рдХрд░реЗрдВ рдЬрд┐рд╕ рдкрд░ рдЖрдкрдХреА рдЯреАрдо рднрд░реЛрд╕рд╛ рдХрд░ рд╕рдХреЗред
рдкреЛрд░реНрдЯрд▓ рдмрдирд╛рдПрдВ

Start by picking an uptime number you can defend on a bad week. For many internal tools, 99.5% is a good first SLO. It sounds high, but it leaves room for normal change work. Jumping straight to 99.9% often means after-hours pages and slower releases.

To make uptime feel real, translate it into time. A 30-day month has about 43,200 minutes:

  • 99.5% uptime allows about 216 minutes of downtime per month
  • 99.9% uptime allows about 43 minutes of downtime per month

That allowed downtime is your error budget. If you burn it early, you pause risky changes and focus on reliability until youтАЩre back on track.

For latency, avoid averages. They hide the slow moments users remember. Use a percentile (usually p95) and set a clear threshold tied to a real action. Examples: тАЬp95 page load for the dashboard is under 2 secondsтАЭ or тАЬp95 Save completes under 800 ms.тАЭ

A simple way to set the first number is to watch a week of real traffic, then choose a target thatтАЩs slightly better than today but not fantasy. If p95 is already 1.9 seconds, a 2.0-second SLO is safe and useful. A 500 ms SLO will just create noise.

Match SLOs to your support capacity. A small team should prefer a few achievable targets over many strict ones. If nobody can respond within an hour, donтАЩt set goals that assume they can.

Make tradeoffs visible: cost, risk, and error budget

SLOs рдХреЗ рд╕рд╛рде рдПрдХ рдЯреВрд▓ рдмрдирд╛рдПрдВ
рдПрдХ рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓ рддреЗрдЬрд╝реА рд╕реЗ рдмрдирд╛рдПрдВ, рдлрд┐рд░ рдЙрди рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛ рдХреНрд░рд┐рдпрд╛рдУрдВ рдХреЗ рдЖрд╕-рдкрд╛рд╕ рд╕реНрдкрд╖реНрдЯ SLOs рд╕реЗрдЯ рдХрд░реЗрдВ рдЬреЛ рдорд╛рдпрдиреЗ рд░рдЦрддреА рд╣реИрдВред
AppMaster рдЖрдЬрд╝рдорд╛рдПрдБ

A tighter SLO sounds comforting, but it has a price. If you move a tool from 99.5% to 99.9% uptime, youтАЩre also saying тАЬwe accept far fewer bad minutes,тАЭ which usually means more paging and more time spent on reliability instead of new features.

The simplest way to make this real is to talk in an error budget. With a 99.5% monthly target, you can тАЬspendтАЭ about 3.6 hours of downtime in a 30-day month. With 99.9%, you only get about 43 minutes. That difference changes how often youтАЩll stop feature work to fix reliability.

It also helps to match expectations to when people actually use the tool. A 24/7 target is expensive if the tool is only critical 9am to 6pm. You can set one SLO for business hours (stricter) and a looser one off-hours (fewer pages) so the team can sleep.

Planned maintenance shouldnтАЩt count as failure as long as itтАЩs communicated and bounded. Treat it as an explicit exception (a maintenance window) rather than ignoring alerts after the fact.

Write down the basics so everyone sees the tradeoffs:

  • The SLO number and what users lose when itтАЩs missed
  • The error budget for the month (in minutes or hours)
  • Paging rules (who, when, and for what)
  • Business-hours vs 24/7 expectations, if different
  • What counts as planned maintenance

After 4 to 6 weeks of real data, review the target. If you never burn error budget, the SLO may be too loose. If you burn it quickly and features stall, itтАЩs probably too tight.

Map SLOs to alerts your team can maintain

Alerts arenтАЩt your SLOs. Alerts are the тАЬsomething is going wrong right nowтАЭ signal that protects the SLO. A simple rule: for each SLO, create one alert that matters, and resist adding more unless you can prove they reduce downtime.

A practical approach is to alert on fast SLO burn (how quickly youтАЩre using up error budget) or on one clear threshold that matches user pain. If your latency SLO is тАЬp95 under 800 ms,тАЭ donтАЩt page on every slow spike. Page only when itтАЩs sustained.

A simple split that keeps noise down:

  • Urgent page: the tool is effectively broken, and someone should act now.
  • Non-urgent ticket: something is degrading, but it can wait until work hours.

Concrete thresholds (adjust to your traffic): if your uptime SLO is 99.5% monthly, page when availability drops below 99% for 10 minutes (clear outage). Create a ticket when it drops below 99.4% over 6 hours (slow burn). For latency, page when p95 is over 1.5 s for 15 minutes; ticket when p95 is over 1.0 s for 2 hours.

Make ownership explicit. Decide who is on call (even if itтАЩs тАЬone person this weekтАЭ), what acknowledge means (for example, within 10 minutes), and what the first action is. For a small team running an internal app built on AppMaster, that first action might be: check recent deployments, look at API errors, then roll back or redeploy if needed.

After every real alert, do one small follow-up: fix the cause or tune the alert so it pages less often but still catches real user impact.

Common mistakes that create alert fatigue

рд╡рд┐рд╢реНрд╡рд╕рдиреАрдпрддрд╛ рдХреЛ рдорд╛рдкрдиреЗ рдпреЛрдЧреНрдп рдмрдирд╛рдПрдВ
рдЕрдкрдиреЗ рдбреЗрдЯрд╛ рдХреЛ PostgreSQL рдореЗрдВ рдореЙрдбрд▓ рдХрд░реЗрдВ рдФрд░ рдПрдХ рдРрд╕рд╛ Go рдмреИрдХрдПрдВрдб рдЬреЗрдирд░реЗрдЯ рдХрд░реЗрдВ рдЬрд┐рд╕реЗ рдЖрдк рдЖрддреНрдорд╡рд┐рд╢реНрд╡рд╛рд╕ рд╕реЗ рдореЙрдирд┐рдЯрд░ рдХрд░ рд╕рдХреЗрдВред
рдмреИрдХрдПрдВрдб рдмрдирд╛рдПрдВ

Alert fatigue usually starts with good intentions. A small team adds тАЬjust a fewтАЭ alerts, then adds one more each week. Soon, people stop trusting notifications, and real outages get missed.

One big trap is alerting on every spike. Internal tools often have bursty traffic (payroll runs, end-of-month reports). If an alert fires on a 2-minute blip, the team learns to ignore it. Tie alerts to user-impact signals, not raw metric noise.

Another trap is thinking тАЬmore metrics = safer.тАЭ More often it means more pages. Stick to a small set of signals users actually feel: login fails, page loads too slowly, key jobs not finishing.

Mistakes that tend to create the most noise:

  • Paging on symptoms (CPU, memory) instead of user impact (errors, latency)
  • No owner for an alert, so it never gets tuned or removed
  • No runbook, so every alert turns into a guessing game
  • Relying on dashboards as a replacement for alerts (dashboards are for looking, alerts are for acting)
  • Making up thresholds because the system is under-instrumented

Dashboards still matter, but they should help you diagnose after an alert fires, not replace the alert.

If you donтАЩt have clean measurements yet, donтАЩt pretend you do. Add basic instrumentation first (success rate, p95 latency, and a тАЬcan a user complete the taskтАЭ check), then set thresholds based on a week or two of real data.

Quick checks before you turn alerts on

Before you enable alerts, do a short pre-flight. Most alert fatigue comes from skipping one of these basics, then trying to fix it later under pressure.

A practical checklist for a small team:

  • Confirm 1 to 3 key user actions (for example: open the dashboard, save a ticket update, export a report).
  • Keep it to 2 to 4 SLIs you can measure today (availability/success rate, p95 latency, error rate for the critical endpoint).
  • Limit yourself to 2 to 4 alerts total for the tool.
  • Agree on the measurement window, including what тАЬbadтАЭ means (last 5 minutes for fast detection, or last 30 to 60 minutes to reduce noise).
  • Assign an owner (one person, not тАЬthe teamтАЭ).

Next, make sure the alert can actually be acted on. An alert that fires when nobody is available trains people to ignore it.

Decide these operations details before the first page:

  • Paging hours: business hours only, or true 24/7
  • Escalation path: who is next if the first person doesnтАЩt respond
  • What to do first: one or two steps to confirm impact and roll back or mitigate
  • A simple monthly review habit: 15 minutes to look at fired alerts, missed incidents, and whether the SLO still matches how the tool is used

If you build or change the tool (including in AppMaster), rerun the checklist. Regenerated code and new flows can shift latency and error patterns, and your alerts should keep up.

Example: a small ops dashboard with two SLOs and three alerts

рдмреЗрд╣рддрд░ рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓ рдмрдирд╛рдПрдВ
рдПрдХ рд╕рдкреЛрд░реНрдЯ рдХрдВрд╕реЛрд▓ рдпрд╛ рдЖрдВрддрд░рд┐рдХ CRM рдмрдирд╛рдПрдВ рдЬрд┐рд╕рдореЗрдВ API рдФрд░ рдмрд┐рдЬрдиреЗрд╕ рд▓реЙрдЬрд┐рдХ рдПрдХ рд╣реА рдЬрдЧрд╣ рд╣реЛред
AppMaster рдЖрдЬрд╝рдорд╛рдПрдБ

An ops team of 18 people uses an internal dashboard all day to check order status, resend failed notifications, and approve refunds. If itтАЩs down or slow, work stops fast.

They pick two SLOs:

  • Uptime SLO: 99.9% successful page loads over 30 days (about 43 minutes of тАЬbad timeтАЭ per month)
  • Latency SLO: p95 page load time under 1.5 seconds during business hours

Now they add three alerts that a small team can handle:

  • Hard down alert (page loads failing): triggers if the success rate drops below 98% for 5 minutes. First action: check recent deploy, restart the web app, confirm database health.
  • Slow dashboard alert: triggers if p95 latency is above 2.5 seconds for 10 minutes. First action: look for a single slow query or a stuck background job, then temporarily pause heavy reports.
  • Error budget burn alert: triggers if theyтАЩre on track to use 50% of the monthly error budget in the next 7 days. First action: stop non-essential changes until things stabilize.

What matters is what happens next week. If the error budget alert fired twice, the team makes a clear call: delay a new feature and spend two days fixing the biggest latency cause (for example, an unindexed table scan). If they built the tool in AppMaster, they can adjust the data model, regenerate, and redeploy clean code instead of stacking quick fixes.

How to keep SLOs alive without turning it into a project

рдореБрдЦреНрдп рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛ рдпрд╛рддреНрд░рд╛ рдкрд░ рдзреНрдпрд╛рди рджреЗрдВ
рд▓реЙрдЧрд┐рди рдФрд░ рд╕реЗрд╡ рдЬреИрд╕реА рдорд╣рддреНрд╡рдкреВрд░реНрдг рдлреНрд▓реЛрдЬрд╝ рдХреЛ рдЯреНрд░реИрдХ рдХрд░рдиреЗ рдпреЛрдЧреНрдп рд╕рдлрд▓рддрд╛ рджрд░ рдФрд░ рд╡рд┐рд▓рдВрдмрддрд╛ рд▓рдХреНрд╖реНрдпреЛрдВ рдореЗрдВ рдмрджрд▓реЗрдВред
рдЕрднреА рдЖрдЬрд╝рдорд╛рдПрдБ

SLOs only help if they stay connected to real work. The trick is to treat them like a small habit, not a new program.

Use a cadence that fits a small team and attach it to an existing meeting. A quick weekly glance catches drift, and a monthly adjustment is enough once you have real data.

A lightweight process that holds up:

  • Weekly (10 minutes): check the SLO chart and the last few alerts, then confirm nothing is quietly getting worse.
  • After any incident (15 minutes): tag the cause and note which user action was affected (login, search, save, export).
  • Monthly (30 minutes): review the top recurring incident pattern and pick one fix for the next month.
  • Monthly (10 minutes): remove or tune one noisy alert.

Keep improvements small and visible. If тАЬslow page loads every Monday morningтАЭ shows up three times, do one concrete change (cache one report, add an index, schedule a heavy job later), then watch the SLI next week.

Use SLOs to say no, politely and clearly. When a request comes in for a low-value feature, point to the current error budget and ask: тАЬWill this change risk our save or approval flow?тАЭ If youтАЩre already burning budget, reliability wins. ThatтАЩs not blocking, itтАЩs prioritizing.

Keep documentation minimal: one page per tool. Include the key user actions, the SLO numbers, the few alerts tied to them, and the owner. If the tool is built on AppMaster, add where you view logs/metrics and who can deploy changes, then stop.

Next steps: start small, then improve one tool at a time

The easiest way to make reliability real is to keep the first setup tiny. Pick one internal tool that causes real pain when it breaks (on-call handoffs, order approvals, refunds, inventory edits), and set targets around the few actions people do every day.

A smallest workable setup most teams can copy:

  • Choose 1 tool and 2 key user actions (for example: Open dashboard and Submit approval).
  • Define 2 SLIs you can measure now: uptime for the endpoint/page, and p95 latency for the action.
  • Set 2 simple SLOs (example: 99.5% uptime monthly, p95 under 800 ms during business hours).
  • Create 2 to 3 alerts total: one for hard down, one for sustained latency, and one for fast error budget burn.
  • Review once a week for 10 minutes: did alerts help, or just make noise?

Once thatтАЩs stable, expand slowly: add one more action, or one more tool per month. If you canтАЩt name who will own an alert, donтАЩt create it yet.

If youтАЩre building or rebuilding internal tools, AppMaster can make the maintenance side easier to sustain. You can update data models and business logic visually and regenerate clean code as needs shift, which helps keep SLOs aligned with what the tool actually does today.

Try building one internal tool and adding basic SLOs from day one. YouтАЩll get clearer expectations, fewer surprises, and alerts your small team can keep up with.

рд╕рд╛рдорд╛рдиреНрдп рдкреНрд░рд╢реНрди

рдХреНрдпрд╛ рдХреЗрд╡рд▓ рдереЛрдбрд╝реА рдЯреАрдо рдЗрд╕реНрддреЗрдорд╛рд▓ рдХрд░рддреА рд╣реИ рддреЛ рднреА рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЛ SLOs рдХреА рдЬрд╝рд░реВрд░рдд рд╣реИ?

SLOs тАЬрдХрд╛рдлреА рд╕реНрдерд┐рд░тАЭ рдЬреИрд╕реА рдЕрдирд┐рд╢реНрдЪрд┐рдд рдмрд╛рддреЛрдВ рдХреЛ рд╕реНрдкрд╖реНрдЯ, рдорд╛рдкрдиреЗ рдпреЛрдЧреНрдп рд▓рдХреНрд╖реНрдп рдореЗрдВ рдмрджрд▓ рджреЗрддреЗ рд╣реИрдВред 20 рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛рдУрдВ рдХреЗ рд╕рд╛рде рднреА, рдПрдХ рдЖрдЙрдЯреЗрдЬ рдСрд░реНрдбрд░ рд░реЛрдХ рд╕рдХрддрд╛ рд╣реИ, рд╕рдкреЛрд░реНрдЯ рдзреАрдорд╛ рдХрд░ рд╕рдХрддрд╛ рд╣реИ, рдпрд╛ рдЕрдиреБрдореЛрджрдиреЛрдВ рдХреЛ рдмреНрд▓реЙрдХ рдХрд░ рд╕рдХрддрд╛ рд╣реИ тАФ рдЗрд╕рд▓рд┐рдП рдЫреЛрдЯреЗ рдЯреВрд▓ рдХрд╛ рдкреНрд░рднрд╛рд╡ рдмрдбрд╝рд╛ рд╣реЛ рд╕рдХрддрд╛ рд╣реИред

рдПрдХ рдЖрдВрддрд░рд┐рдХ рдПрдбрдорд┐рди рдкреИрдирд▓ рдпрд╛ ops рдбреИрд╢рдмреЛрд░реНрдб рдХреЗ рд▓рд┐рдП рд╣рдореЗрдВ рд╕рдмрд╕реЗ рдкрд╣рд▓реЗ рдХреНрдпрд╛ рдорд╛рдкрдирд╛ рдЪрд╛рд╣рд┐рдП?

рдЙрди рдХреБрдЫ рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛ рдХрд╛рд░реНрд░рд╡рд╛рдЗрдпреЛрдВ рдХреЛ рдЪреБрдиреЗрдВ рдЬреЛ рд▓реЛрдЧ рд╣рд░ рджрд┐рди рдХрд░рддреЗ рд╣реИрдВ рдФрд░ рдЬреЛ рдлреЗрд▓ рд╣реЛрдиреЗ рдкрд░ рдХрд╛рдо рд░реЛрдХ рджреЗрдВред рд╕рд╛рдорд╛рдиреНрдп рд╢реБрд░реБрдЖрддреА рдЪреАрдЬрд╝реЗрдВ рд╣реИрдВ: рд▓реЙрдЧрд┐рди, рдореБрдЦреНрдп рдбреИрд╢рдмреЛрд░реНрдб рдХреЛ рддрд╛рдЬрд╝рд╛ рдбреЗрдЯрд╛ рдХреЗ рд╕рд╛рде рд▓реЛрдб рдХрд░рдирд╛, рдЦреЛрдЬ/рдлрд┐рд▓реНрдЯрд░, рдФрд░ рдХреЛрдИ create/update рдлрд╝реЙрд░реНрдо рд╕рдлрд▓рддрд╛рдкреВрд░реНрд╡рдХ рд╕рдмрдорд┐рдЯ рдХрд░рдирд╛ред

SLI, SLO, рдФрд░ SLA рдореЗрдВ рдХреНрдпрд╛ рдЕрдВрддрд░ рд╣реИ?

SLI рд╡рд╣ рдореЗрдЯреНрд░рд┐рдХ рд╣реИ рдЬрд┐рд╕реЗ рдЖрдк рдорд╛рдкрддреЗ рд╣реИрдВ (рдЬреИрд╕реЗ рд╕рдлрд▓рддрд╛ рджрд░ рдпрд╛ p95 рд╡рд┐рд▓рдВрдмрддрд╛)ред SLO рдЙрд╕ рдореЗрдЯреНрд░рд┐рдХ рдХрд╛ рд▓рдХреНрд╖реНрдп рд╣реИ (рдЬреИрд╕реЗ 30 рджрд┐рдиреЛрдВ рдореЗрдВ 99.5% рд╕рдлрд▓рддрд╛)ред SLA рдПрдХ рдФрдкрдЪрд╛рд░рд┐рдХ рд╡рд╛рджрд╛ рд╣реЛрддрд╛ рд╣реИ рдЬрд┐рд╕рдХреЗ рдкрд░рд┐рдгрд╛рдо рд╣реЛ рд╕рдХрддреЗ рд╣реИрдВ; рдЕрдзрд┐рдХрд╛рдВрд╢ рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЛ рдЗрд╕рдХреА рдЬрд╝рд░реВрд░рдд рдирд╣реАрдВ рд╣реЛрддреАред

рдПрдХ рдЫреЛрдЯреА рдЯреАрдо рдХреЗ рд▓рд┐рдП рд╡рд╛рд╕реНрддрд╡рд┐рдХрддрд╛рдкреВрд░реНрдг рдЕрдкрдЯрд╛рдЗрдо SLO рдХреНрдпрд╛ рд╣реЛрдирд╛ рдЪрд╛рд╣рд┐рдП?

рдмрд╣реБрдд рд╕реЗ рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЗ рд▓рд┐рдП рдПрдХ рдЕрдЪреНрдЫрд╛ рдкрд╣рд▓рд╛ рдЕрдкрдЯрд╛рдЗрдо SLO 99.5% рдорд╛рд╕рд┐рдХ рд╣реИ, рдХреНрдпреЛрдВрдХрд┐ рдпрд╣ рд▓рдЧрд╛рддрд╛рд░ рд╣реАрд░реЛрдЗрдХ рдХреЛрд╢рд┐рд╢реЛрдВ рдХреЗ рдмрд┐рдирд╛ рд╣рд╛рд╕рд┐рд▓ рдХрд░рдиреЗ рдпреЛрдЧреНрдп рд╣реИред рдЕрдЧрд░ рдЯреВрд▓ рдХреЗрд╡рд▓ рдХрд╛рдо рдХреЗ рдШрдВрдЯреЛрдВ рдореЗрдВ рдорд┐рд╢рди-рдХреНрд░рд┐рдЯрд┐рдХрд▓ рд╣реИ рддреЛ рдмрд╛рдж рдореЗрдВ рдЗрд╕реЗ рд╕рдШрди рдХрд┐рдпрд╛ рдЬрд╛ рд╕рдХрддрд╛ рд╣реИред

рд╣рдо рдЕрдкрдЯрд╛рдЗрдо рдкреНрд░рддрд┐рд╢рдд рдХреЛ рд▓реЛрдЧреЛрдВ рдХреЗ рд╕рдордЭ рдореЗрдВ рдХреИрд╕реЗ рд▓рд╛рдПрдБ?

рдЕрдкрдЯрд╛рдЗрдо рдкреНрд░рддрд┐рд╢рдд рдХреЛ рдорд┐рдирдЯреЛрдВ рдореЗрдВ рдЯреНрд░рд╛рдВрд╕рд▓реЗрдЯ рдХрд░реЗрдВ рддрд╛рдХрд┐ рд╕рдм рд╕рдордЭ рд╕рдХреЗрдВред 30 рджрд┐рди рдХреЗ рдорд╣реАрдиреЗ рдореЗрдВ 99.5% рд▓рдЧрднрдЧ 216 рдорд┐рдирдЯ рдбрд╛рдЙрдирдЯрд╛рдЗрдо рдХреА рдЕрдиреБрдорддрд┐ рджреЗрддрд╛ рд╣реИ; 99.9% рд▓рдЧрднрдЧ 43 рдорд┐рдирдЯ рджреЗрддрд╛ рд╣реИ тАФ рдФрд░ рдпрд╣ рдЕрдХреНрд╕рд░ рдЕрдзрд┐рдХ рдкреЗрдЬрд┐рдВрдЧ рдФрд░ reliability рдХрд╛рдо рдХрд╛ рдХрд╛рд░рдг рдмрдирддрд╛ рд╣реИред

рдХрд┐рд╕ рддрд░рд╣ рд╕реЗ рд╣рдо рд╡рд┐рд▓рдВрдмрддрд╛ SLO рд╕реЗрдЯ рдХрд░реЗрдВ рдмрд┐рдирд╛ рд╢реЛрд░ рдмрдврд╝рд╛рдП?

рдФрд╕рдд рдХреЗ рдмрдЬрд╛рдп рдкрд░реНрд╕реЗрдВрдЯрд╛рдЗрд▓ (рдЬреИрд╕реЗ p95) рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░реЗрдВ, рдХреНрдпреЛрдВрдХрд┐ рдФрд╕рдд рдзреАрдореЗ рдХреНрд╖рдгреЛрдВ рдХреЛ рдЫрд┐рдкрд╛рддрд╛ рд╣реИред рд▓рдХреНрд╖реНрдп рдХреЛ рдХрд┐рд╕реА рд╡рд╛рд╕реНрддрд╡рд┐рдХ рдХреНрд░рд┐рдпрд╛ рд╕реЗ рдЬреЛрдбрд╝реЗрдВ (рдЙрджрд╛., тАЬрдкреНрд░реАрдорд┐рдпрдо рдбреИрд╢рдмреЛрд░реНрдб рдХрд╛ p95 рд▓реЛрдб 2s рд╕реЗ рдХрдо рд╣реЛтАЭ) рдФрд░ рдРрд╕рд╛ рдереНрд░реЗрд╢реЛрд▓реНрдб рдЪреБрдиреЗрдВ рдЬрд┐рд╕реЗ рдЖрдк рдЖрд░рд╛рдо рд╕реЗ рдмрдирд╛рдП рд░рдЦ рд╕рдХреЗрдВред

рдмрдбрд╝реЗ рдореЙрдирд┐рдЯрд░рд┐рдВрдЧ рд╕рд┐рд╕реНрдЯрдо рдХреЗ рдмрд┐рдирд╛ рдХрд┐рди SLIs рдХреЛ рдорд╛рдкрдирд╛ рдЖрд╕рд╛рди рд╣реИ?

рдкрд╣рд▓реЗ рдЙрди рд╕рд░реНрд╡рд░ рдореЗрдЯреНрд░рд┐рдХреНрд╕ рдФрд░ рд▓реЙрдЧреНрд╕ рд╕реЗ рд╢реБрд░реВ рдХрд░реЗрдВ рдЬреЛ рдкрд╣рд▓реЗ рд╕реЗ рдореМрдЬреВрдж рд╣реИрдВ: рдЙрдкрд▓рдмреНрдзрддрд╛ (reachable рдФрд░ responding), рдХреА рдХрд╛рд░реНрдпреЛрдВ рдХреА рд╕рдлрд▓рддрд╛ рджрд░, рдФрд░ 1тАУ2 рдорд╣рддреНрд╡рдкреВрд░реНрдг рдПрдВрдбрдкреЙрдЗрдВрдЯреНрд╕ рдХреЗ рд▓рд┐рдП p95 рд╡рд┐рд▓рдВрдмрддрд╛ред рд╕рдмрд╕реЗ рдорд╣рддреНрд╡рдкреВрд░реНрдг рдлреНрд▓реЛрдЬрд╝ рдХреЗ рд▓рд┐рдП synthetic checks рддрднреА рдЬреЛрдбрд╝реЗрдВ рдЬрдм рдЬрд╝рд░реВрд░реА рд╣реЛред

рдПрдХ рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓ рдХреЗ рд▓рд┐рдП рд╣рдореЗрдВ рдХрд┐рддрдиреЗ рдЕрд▓рд░реНрдЯ рд╕реЗрдЯ рдХрд░рдиреЗ рдЪрд╛рд╣рд┐рдП?

рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛-рдкреНрд░рднрд╛рд╡ рд╕реЗ рдЬреБрдбрд╝реЗ рдЫреЛрдЯреЗ рд╕реЗрдЯ рдкрд░ рдбрд┐рдлрд╝реЙрд▓реНрдЯ рдХрд░реЗрдВ рдФрд░ рдХреЗрд╡рд▓ sustained рд╕рдорд╕реНрдпрд╛рдУрдВ рдкрд░ рдкреЗрдЬ рдХрд░реЗрдВред рдПрдХ рдЙрдкрдпреЛрдЧреА рд╡рд┐рднрд╛рдЬрди рд╣реИ: рдПрдХ рддрд╛рддреНрдХрд╛рд▓рд┐рдХ рдкреЗрдЬ "рдЯреВрд▓ рдкреНрд░рднрд╛рд╡реА рд░реВрдк рд╕реЗ рдЦрд░рд╛рдм рд╣реИ" рдХреЗ рд▓рд┐рдП рдФрд░ рдПрдХ рдЧреИрд░-рддрд╛рддреНрдХрд╛рд▓рд┐рдХ рдЯрд┐рдХрдЯ "рдзреАрдореА рдЧрд┐рд░рд╛рд╡рдЯ" рдХреЗ рд▓рд┐рдПред

рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЗ рд╕рд╛рде рдЕрд▓рд░реНрдЯ рдердХрд╛рди рдХреНрдпрд╛ рдкреИрджрд╛ рдХрд░рддреА рд╣реИ, рдФрд░ рд╣рдо рдЗрд╕рд╕реЗ рдХреИрд╕реЗ рдмрдЪреЗрдВ?

рдЕрдзрд┐рдХрд╛рдВрд╢ рдЕрд▓рд░реНрдЯ-рдердХрд╛рдирд┐рдпрд╛рдБ рд╣рд░ рд╕реНрдкрд╛рдЗрдХ рдкрд░ рдкреЗрдЬрд┐рдВрдЧ рдпрд╛ CPU рдЬреИрд╕реА рд▓рдХреНрд╖рдгреЛрдВ рдкрд░ рдкреЗрдЬрд┐рдВрдЧ рд╕реЗ рдЖрддреА рд╣реИрдВ, рди рдХрд┐ рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛-рдкреНрд░рднрд╛рд╡ (errors, latency) рдкрд░ред рдЕрд▓рд░реНрдЯ рдХрдо рд░рдЦреЗрдВ, рд╣рд░ рдЕрд▓рд░реНрдЯ рдХрд╛ рдПрдХ рдорд╛рд▓рд┐рдХ рд░рдЦреЗрдВ, рдФрд░ рд╣рд░ рдЕрд╕рд▓реА рдЕрд▓рд░реНрдЯ рдХреЗ рдмрд╛рдж рдХрд╛рд░рдг рдареАрдХ рдХрд░реЗрдВ рдпрд╛ рдЕрд▓рд░реНрдЯ рдХреЛ рдЯреНрдпреВрди рдХрд░реЗрдВ рддрд╛рдХрд┐ рдпрд╣ рдХрдо рдмрд╛рд░ рд▓реЗрдХрд┐рди рд╕рд╣реА рдХрд╛рдо рдХрд░реЗред

рдЕрдЧрд░ рд╣рдо рдЕрдкрдиреЗ рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓ AppMaster рдореЗрдВ рдмрдирд╛рддреЗ рд╣реИрдВ рддреЛ SLOs рдХреИрд╕реЗ рд▓рд╛рдЧреВ рдХрд░реЗрдВ?

рдЙрди рдкреНрд░рдореБрдЦ рдХреНрд░рд┐рдпрд╛рдУрдВ рдХреЛ рдЪреБрдиреЗрдВ, рдлрд┐рд░ рдЙрдирдХреЗ рд▓рд┐рдП рдЙрдкрд▓рдмреНрдзрддрд╛, рд╕рдлрд▓рддрд╛ рджрд░, рдФрд░ p95 рд╡рд┐рд▓рдВрдмрддрд╛ рдорд╛рдкреЗрдВтАФрдПрдХ рд╕реБрд╕рдВрдЧрдд рд╕реНрд░реЛрдд рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рддреЗ рд╣реБрдПред AppMaster рдореЗрдВ рдмрдиреЗ рдЯреВрд▓реНрд╕ рдХреЗ рд▓рд┐рдП рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЛ рдЙрдкрдпреЛрдЧрдХрд░реНрддрд╛ рдХреНрд░рд┐рдпрд╛рдУрдВ рдкрд░ рдХреЗрдВрджреНрд░рд┐рдд рд░рдЦреЗрдВ рдФрд░ рдмрдбрд╝реЗ рдмрджрд▓рд╛рд╡реЛрдВ рдпрд╛ regenerate рдХреЗ рдмрд╛рдж рдореЗрдЯреНрд░рд┐рдХреНрд╕ рдФрд░ рдЕрд▓рд░реНрдЯ рд╕рдорд╛рдпреЛрдЬрд┐рдд рдХрд░реЗрдВ рддрд╛рдХрд┐ рд╡реЗ рд╡рд░реНрддрдорд╛рди рд╡реНрдпрд╡рд╣рд╛рд░ рд╕реЗ рдореЗрд▓ рдЦрд╛рдПрдБред

рд╢реБрд░реВ рдХрд░рдирд╛ рдЖрд╕рд╛рди
рдХреБрдЫ рдмрдирд╛рдПрдВ рдЕрджреНрднреБрдд

рдлреНрд░реА рдкреНрд▓рд╛рди рдХреЗ рд╕рд╛рде рдРрдкрдорд╛рд╕реНрдЯрд░ рдХреЗ рд╕рд╛рде рдкреНрд░рдпреЛрдЧ рдХрд░реЗрдВред
рдЬрдм рдЖрдк рддреИрдпрд╛рд░ рд╣реЛрдВрдЧреЗ рддрдм рдЖрдк рдЙрдЪрд┐рдд рд╕рджрд╕реНрдпрддрд╛ рдЪреБрди рд╕рдХрддреЗ рд╣реИрдВред

рд╢реБрд░реВ рд╣реЛ рдЬрд╛рдУ
рдЖрдВрддрд░рд┐рдХ рдЯреВрд▓реНрд╕ рдХреЗ рд▓рд┐рдП SLOs: рд╕рд░рд▓ рд╡рд┐рд╢реНрд╡рд╕рдиреАрдпрддрд╛ рд▓рдХреНрд╖реНрдп рдЬреЛ рдХрд╛рдо рдХрд░рддреЗ рд╣реИрдВ | AppMaster