SLOs for internal tools: simple reliability targets that work
SLOs for internal tools made simple: set measurable uptime and latency goals, then map them to alerts a small team can maintain without burnout.

Why internal tools need SLOs (even if only 20 people use them)
Internal tools feel small because the audience is small. The impact often isnât: if your ops dashboard is down, orders pause; if your support console is slow, customers wait; if your admin panel breaks, fixes stack up.
Without clear reliability targets, every outage becomes a debate. One person shrugs at a 10-minute glitch, another treats it like a crisis. You lose time to noisy chats, unclear priorities, and surprise work at the worst moment.
SLOs fix that by setting simple expectations you can measure. They answer two practical questions: what must work, and how well must it work for people to do their jobs.
The hidden cost of âweâll keep it pretty stableâ shows up fast. Work stops while teams wait for a tool to recover. Support pings multiply because nobody knows whatâs normal. Engineers get dragged into urgent fixes instead of planned improvements. Product owners stop trusting the system and start asking for manual backups. Small issues linger because they never cross a clear line.
You donât need a full reliability program. A small team can start with a few user-focused goals like âlogin worksâ or âsearch results load fast,â plus a small set of alerts tied to real action.
This applies no matter how the tool is built. If youâre using AppMaster (appmaster.io) to create internal apps, pick the actions people rely on, measure uptime and response time, and alert only when it affects work.
SLOs, SLIs, and SLAs in plain words
These three terms sound similar, but theyâre different kinds of reliability language. Mixing them up is a common source of confusion.
An SLI (Service Level Indicator) is a measurement. Itâs something you can count, like âpercent of requests that succeededâ or âhow long the page took to load.â If you canât measure it reliably, itâs not a good SLI.
An SLO (Service Level Objective) is the goal for that measurement. It answers: what level is good enough for users most of the time? SLOs help you decide what to fix first and what can wait.
An SLA (Service Level Agreement) is a promise, usually written down, often with consequences. Many internal tools donât need SLAs at all. They need clear goals, not legal-style commitments.
A quick example:
- SLI (uptime): Percentage of minutes the tool is reachable.
- SLO (uptime goal): 99.9% monthly uptime.
- SLI (latency): p95 page load time for the dashboard.
- SLO (latency goal): p95 under 2 seconds during business hours.
Notice whatâs missing: ânever downâ or âalways fast.â SLOs arenât about perfection. They make tradeoffs visible so a small team can choose between features, reliability work, and avoiding unnecessary toil.
A practical rule: if meeting the target would require heroics, itâs not an SLO, itâs wishful thinking. Start with something your team can maintain calmly, then tighten it later if users still feel pain.
Pick the few user actions that really matter
Internal tools fail in specific ways: the admin panel loads but saving a record spins forever; an ops dashboard opens but charts never refresh; a staff portal works except login breaks after an update. You get the most value by focusing on the actions people rely on every day, not every page and button.
Start by naming the tool type, because it hints at the critical paths. Admin panels are about âchange something safely.â Ops dashboards are about âsee whatâs happening now.â Portals are about âget in, find info, and submit a request.â
Then write down the top user journeys in plain language. A good starting set:
- Login and reach the home screen
- Search or filter and get results
- Submit a form (create/update) and see a success message
- Load the main dashboard view with fresh data
- Export or download the report people use for daily work
For each journey, define what counts as failure. Be strict and measurable: a 500 error is a failure, but so is a timeout, a page that never finishes loading, or a form that returns success while the data is missing.
Keep the scope small at first. Pick 1 to 3 journeys that match real pain and real risk. If on-call pages are usually ânobody can log inâ and âthe save button hangs,â start with Login and Submit form. Add Search later once you trust the measurements and the alerts.
Choose SLIs you can actually measure
Good SLIs are boring. They come from data you already have, and they match what users feel when the tool works or fails. If you need a whole new monitoring setup just to measure them, pick simpler SLIs.
Start with availability in terms people understand: can I open the tool and can I finish the task? For many internal tools, two SLIs cover most pain:
- Uptime for the tool (is it reachable and responding)
- Success rate for 1 to 3 key actions (login, search, save, approve)
Then add latency, but keep it narrow. Choose one or two screens or endpoints that represent the wait users complain about, like loading the dashboard or submitting a form. Measuring everything usually creates noise and arguments.
Decide the measurement window up front. A rolling 30 days is common for steady tools; weekly can work when you release often and want fast feedback. Whatever you choose, stick with it so trends mean something.
Finally, pick one source of truth per SLI and write it down:
- Synthetic checks (a bot hits a health check or runs a simple flow)
- Server metrics (request counts, errors, latency from your backend)
- Logs (count âsuccessâ vs âfailedâ events for a specific action)
Example: if your internal app is built on AppMaster, you can measure uptime with a synthetic ping to the backend, success rate from API responses, and latency from backend request timings. The key is consistency, not perfection.
Set realistic uptime and latency SLOs
Start by picking an uptime number you can defend on a bad week. For many internal tools, 99.5% is a good first SLO. It sounds high, but it leaves room for normal change work. Jumping straight to 99.9% often means after-hours pages and slower releases.
To make uptime feel real, translate it into time. A 30-day month has about 43,200 minutes:
- 99.5% uptime allows about 216 minutes of downtime per month
- 99.9% uptime allows about 43 minutes of downtime per month
That allowed downtime is your error budget. If you burn it early, you pause risky changes and focus on reliability until youâre back on track.
For latency, avoid averages. They hide the slow moments users remember. Use a percentile (usually p95) and set a clear threshold tied to a real action. Examples: âp95 page load for the dashboard is under 2 secondsâ or âp95 Save completes under 800 ms.â
A simple way to set the first number is to watch a week of real traffic, then choose a target thatâs slightly better than today but not fantasy. If p95 is already 1.9 seconds, a 2.0-second SLO is safe and useful. A 500 ms SLO will just create noise.
Match SLOs to your support capacity. A small team should prefer a few achievable targets over many strict ones. If nobody can respond within an hour, donât set goals that assume they can.
Make tradeoffs visible: cost, risk, and error budget
A tighter SLO sounds comforting, but it has a price. If you move a tool from 99.5% to 99.9% uptime, youâre also saying âwe accept far fewer bad minutes,â which usually means more paging and more time spent on reliability instead of new features.
The simplest way to make this real is to talk in an error budget. With a 99.5% monthly target, you can âspendâ about 3.6 hours of downtime in a 30-day month. With 99.9%, you only get about 43 minutes. That difference changes how often youâll stop feature work to fix reliability.
It also helps to match expectations to when people actually use the tool. A 24/7 target is expensive if the tool is only critical 9am to 6pm. You can set one SLO for business hours (stricter) and a looser one off-hours (fewer pages) so the team can sleep.
Planned maintenance shouldnât count as failure as long as itâs communicated and bounded. Treat it as an explicit exception (a maintenance window) rather than ignoring alerts after the fact.
Write down the basics so everyone sees the tradeoffs:
- The SLO number and what users lose when itâs missed
- The error budget for the month (in minutes or hours)
- Paging rules (who, when, and for what)
- Business-hours vs 24/7 expectations, if different
- What counts as planned maintenance
After 4 to 6 weeks of real data, review the target. If you never burn error budget, the SLO may be too loose. If you burn it quickly and features stall, itâs probably too tight.
Map SLOs to alerts your team can maintain
Alerts arenât your SLOs. Alerts are the âsomething is going wrong right nowâ signal that protects the SLO. A simple rule: for each SLO, create one alert that matters, and resist adding more unless you can prove they reduce downtime.
A practical approach is to alert on fast SLO burn (how quickly youâre using up error budget) or on one clear threshold that matches user pain. If your latency SLO is âp95 under 800 ms,â donât page on every slow spike. Page only when itâs sustained.
A simple split that keeps noise down:
- Urgent page: the tool is effectively broken, and someone should act now.
- Non-urgent ticket: something is degrading, but it can wait until work hours.
Concrete thresholds (adjust to your traffic): if your uptime SLO is 99.5% monthly, page when availability drops below 99% for 10 minutes (clear outage). Create a ticket when it drops below 99.4% over 6 hours (slow burn). For latency, page when p95 is over 1.5 s for 15 minutes; ticket when p95 is over 1.0 s for 2 hours.
Make ownership explicit. Decide who is on call (even if itâs âone person this weekâ), what acknowledge means (for example, within 10 minutes), and what the first action is. For a small team running an internal app built on AppMaster, that first action might be: check recent deployments, look at API errors, then roll back or redeploy if needed.
After every real alert, do one small follow-up: fix the cause or tune the alert so it pages less often but still catches real user impact.
Common mistakes that create alert fatigue
Alert fatigue usually starts with good intentions. A small team adds âjust a fewâ alerts, then adds one more each week. Soon, people stop trusting notifications, and real outages get missed.
One big trap is alerting on every spike. Internal tools often have bursty traffic (payroll runs, end-of-month reports). If an alert fires on a 2-minute blip, the team learns to ignore it. Tie alerts to user-impact signals, not raw metric noise.
Another trap is thinking âmore metrics = safer.â More often it means more pages. Stick to a small set of signals users actually feel: login fails, page loads too slowly, key jobs not finishing.
Mistakes that tend to create the most noise:
- Paging on symptoms (CPU, memory) instead of user impact (errors, latency)
- No owner for an alert, so it never gets tuned or removed
- No runbook, so every alert turns into a guessing game
- Relying on dashboards as a replacement for alerts (dashboards are for looking, alerts are for acting)
- Making up thresholds because the system is under-instrumented
Dashboards still matter, but they should help you diagnose after an alert fires, not replace the alert.
If you donât have clean measurements yet, donât pretend you do. Add basic instrumentation first (success rate, p95 latency, and a âcan a user complete the taskâ check), then set thresholds based on a week or two of real data.
Quick checks before you turn alerts on
Before you enable alerts, do a short pre-flight. Most alert fatigue comes from skipping one of these basics, then trying to fix it later under pressure.
A practical checklist for a small team:
- Confirm 1 to 3 key user actions (for example: open the dashboard, save a ticket update, export a report).
- Keep it to 2 to 4 SLIs you can measure today (availability/success rate, p95 latency, error rate for the critical endpoint).
- Limit yourself to 2 to 4 alerts total for the tool.
- Agree on the measurement window, including what âbadâ means (last 5 minutes for fast detection, or last 30 to 60 minutes to reduce noise).
- Assign an owner (one person, not âthe teamâ).
Next, make sure the alert can actually be acted on. An alert that fires when nobody is available trains people to ignore it.
Decide these operations details before the first page:
- Paging hours: business hours only, or true 24/7
- Escalation path: who is next if the first person doesnât respond
- What to do first: one or two steps to confirm impact and roll back or mitigate
- A simple monthly review habit: 15 minutes to look at fired alerts, missed incidents, and whether the SLO still matches how the tool is used
If you build or change the tool (including in AppMaster), rerun the checklist. Regenerated code and new flows can shift latency and error patterns, and your alerts should keep up.
Example: a small ops dashboard with two SLOs and three alerts
An ops team of 18 people uses an internal dashboard all day to check order status, resend failed notifications, and approve refunds. If itâs down or slow, work stops fast.
They pick two SLOs:
- Uptime SLO: 99.9% successful page loads over 30 days (about 43 minutes of âbad timeâ per month)
- Latency SLO: p95 page load time under 1.5 seconds during business hours
Now they add three alerts that a small team can handle:
- Hard down alert (page loads failing): triggers if the success rate drops below 98% for 5 minutes. First action: check recent deploy, restart the web app, confirm database health.
- Slow dashboard alert: triggers if p95 latency is above 2.5 seconds for 10 minutes. First action: look for a single slow query or a stuck background job, then temporarily pause heavy reports.
- Error budget burn alert: triggers if theyâre on track to use 50% of the monthly error budget in the next 7 days. First action: stop non-essential changes until things stabilize.
What matters is what happens next week. If the error budget alert fired twice, the team makes a clear call: delay a new feature and spend two days fixing the biggest latency cause (for example, an unindexed table scan). If they built the tool in AppMaster, they can adjust the data model, regenerate, and redeploy clean code instead of stacking quick fixes.
How to keep SLOs alive without turning it into a project
SLOs only help if they stay connected to real work. The trick is to treat them like a small habit, not a new program.
Use a cadence that fits a small team and attach it to an existing meeting. A quick weekly glance catches drift, and a monthly adjustment is enough once you have real data.
A lightweight process that holds up:
- Weekly (10 minutes): check the SLO chart and the last few alerts, then confirm nothing is quietly getting worse.
- After any incident (15 minutes): tag the cause and note which user action was affected (login, search, save, export).
- Monthly (30 minutes): review the top recurring incident pattern and pick one fix for the next month.
- Monthly (10 minutes): remove or tune one noisy alert.
Keep improvements small and visible. If âslow page loads every Monday morningâ shows up three times, do one concrete change (cache one report, add an index, schedule a heavy job later), then watch the SLI next week.
Use SLOs to say no, politely and clearly. When a request comes in for a low-value feature, point to the current error budget and ask: âWill this change risk our save or approval flow?â If youâre already burning budget, reliability wins. Thatâs not blocking, itâs prioritizing.
Keep documentation minimal: one page per tool. Include the key user actions, the SLO numbers, the few alerts tied to them, and the owner. If the tool is built on AppMaster, add where you view logs/metrics and who can deploy changes, then stop.
Next steps: start small, then improve one tool at a time
The easiest way to make reliability real is to keep the first setup tiny. Pick one internal tool that causes real pain when it breaks (on-call handoffs, order approvals, refunds, inventory edits), and set targets around the few actions people do every day.
A smallest workable setup most teams can copy:
- Choose 1 tool and 2 key user actions (for example: Open dashboard and Submit approval).
- Define 2 SLIs you can measure now: uptime for the endpoint/page, and p95 latency for the action.
- Set 2 simple SLOs (example: 99.5% uptime monthly, p95 under 800 ms during business hours).
- Create 2 to 3 alerts total: one for hard down, one for sustained latency, and one for fast error budget burn.
- Review once a week for 10 minutes: did alerts help, or just make noise?
Once thatâs stable, expand slowly: add one more action, or one more tool per month. If you canât name who will own an alert, donât create it yet.
If youâre building or rebuilding internal tools, AppMaster can make the maintenance side easier to sustain. You can update data models and business logic visually and regenerate clean code as needs shift, which helps keep SLOs aligned with what the tool actually does today.
Try building one internal tool and adding basic SLOs from day one. Youâll get clearer expectations, fewer surprises, and alerts your small team can keep up with.
FAQ
SLOs stop reliability arguments by turning âpretty stableâ into a clear target you can measure. Even with 20 users, an outage can pause orders, slow support, or block approvals, so small tools can still have big impact.
Pick a few user actions that people do every day and that block work when they fail. Common starters are login, loading the main dashboard with fresh data, searching/filtering, and submitting a create/update form successfully.
An SLI is the metric you measure (like success rate or p95 latency). An SLO is the goal for that metric (like 99.5% success over 30 days). An SLA is a formal promise with consequences, and most internal tools donât need that.
A good first uptime SLO for many internal tools is 99.5% monthly, because itâs achievable without constant heroics. If the tool is truly mission-critical during work hours, you can tighten it later once youâve seen real data.
Translate the uptime percent into minutes so everyone understands the tradeoff. In a 30-day month, 99.5% allows about 216 minutes of downtime, while 99.9% allows about 43 minutes, which often means more paging and more reliability work.
Use a percentile like p95, not an average, because averages hide the slow moments users feel. Set the target on a real action (like âp95 dashboard load under 2s during business hoursâ) and choose a threshold you can maintain calmly.
Start with server metrics and logs you already have: availability (reachable and responding), success rate for key actions, and p95 latency for one or two critical endpoints or screens. Add synthetic checks only for the most important flows so measurement stays consistent and simple.
Default to a small set of alerts tied to user impact, and page only on sustained problems. A useful split is one urgent page for âtool is effectively brokenâ and one non-urgent ticket for âslow burnâ degradation you can handle during work hours.
Most alert fatigue comes from paging on every spike or on symptoms like CPU instead of user impact like errors and latency. Keep alerts few, give each one an owner, and after every real alert either fix the cause or tune the alert so it fires less often but still catches real issues.
Pick the key actions in your app, then measure uptime, success rate, and p95 latency for those actions using a consistent source of truth. If you build internal tools in AppMaster, keep the targets focused on what users do (login, save, search), and adjust measurements and alerts after major changes or regenerations so they match current behavior.


