Every monitoring product has a number. A single 0-to-100 score that's supposed to tell you, at a glance, whether your thing is healthy. These numbers are almost always wrong. Not wrong in the sense of buggy — wrong in the sense of "designed to feel good rather than to tell you something."

When we started building Site Health Monitoring, I knew we were going to have a score. It's table stakes. Customers will not accept "here are 47 issues, good luck" as a dashboard — they need a summary. But I also knew that the first version of the score we built would probably be bad, because every simple weighted sum gets gamed by edge cases within a week of launch.

This post is the story of building the score, breaking it, rebuilding it, and the specific trade-offs that got us to the formula we shipped. It's a longer post than I expected to write. If you're building monitoring tooling of your own, I hope some of this is useful.

What we actually ship

Let me start with the punchline. The current VectraSEO health score formula is:

score = max(0, 100 − 15 × critical − 5 × warning − 1 × info)

You start at 100. Each critical issue subtracts 15 points. Each warning subtracts 5. Each info subtracts 1. The score is clamped at zero — it doesn't go negative. That's it.

If you read that formula and thought "surely it's more sophisticated than that," no. It is deliberately not. Every piece of sophistication we tried to add made the score worse, not better, and I want to walk through why.

The first version (which was bad)

Our first attempt was a percentage-based score. We computed the percentage of URLs in the sitemap that had issues, inverted it, and scaled it to 0–100. A site with 0 issues out of 200 URLs scored 100. A site with 100 issues out of 200 URLs scored 50. Simple, intuitive.

It was useless. Here's why.

First, the score was dominated by info-level issues. A site might have 150 URLs with minor meta-description length warnings — issues that, in isolation, are barely worth fixing — and its score would drop to 25. Meanwhile, a site with three critical noindex-in-sitemap issues and no info issues would score 98. The first site was basically fine. The second site was on fire. The score was telling us the opposite of what it should.

Second, the percentage framing scaled badly. A site with 10 URLs that had 5 issues scored the same as a site with 10,000 URLs that had 5,000 issues, despite being at completely different scales of problem. And a site with 200 URLs and 1 issue scored 99.5%, which rounded to 100 and felt dishonest.

Third, the score was volatile in a bad way. Adding ten new URLs to your sitemap would change your denominator, which would change your score even if no issues had changed. The score was responding to sitemap size, not site health.

So we scrapped the percentage approach and started over.

The weighted-sum approach

What we tried next was the weighted sum that we eventually shipped. Start at 100, subtract weighted points per issue, floor at zero. The hard question became: what are the weights?

Our first pass was critical × 10, warning × 3, info × 1. I picked those numbers in about ten minutes based on gut feel. They lasted about a week.

The problem we hit: info-level issues were still dominating the score. The math is obvious in retrospect. If a 200-URL site has one critical issue and fifty info issues, the critical subtracts 10 points and the info subtracts 50. The critical issue — the active fire — accounts for 17% of the score hit. That's not right.

We tried 20/5/1. Then 15/5/1. Then 25/10/2. We ran each version against a rolling backup of customer scan data (with PII scrubbed) and looked at whether the ranked order of sites by score matched what we, as humans, would rank them.

15/5/1 was the best fit. Here's the intuition:

A 15× weight on critical means a single critical issue drops your score from 100 to 85. Your "Good" score is broken. A healthy site with one critical is not a healthy site.
A 5× weight on warning means three warnings equals one critical. This feels right — three noticeable issues are about as bad as one active fire.
A 1× weight on info means you'd need fifteen info issues to equal one critical. You will have info issues; they shouldn't dominate the number.

The specific ratio 15:5:1 also happens to be close to 3:1:0.2 after normalization, which matches SEO community consensus on issue severity (though I'm not going to pretend we derived it from first principles — we fit it to our data and then noticed the match).

The clamping question

One debate we had internally: should the score go below zero?

Arguments for: if a site has 50 critical issues (500 negative points by the formula), it is dramatically worse than a site with 10 critical issues (100 negative points). Both score 0 in our current formula, and that loses information.

Arguments against: scores below zero are confusing. "Your site scores -800 out of 100" is not a communicable number. Users expect 0–100 scales. Negative scores would make graph axes ugly and make marketing copy hard to write.

We clamped at zero. The argument that won was: if your score is zero, you don't need a finer-grained comparison. You need to go fix critical issues immediately. The difference between "bad" and "very bad" is not actionable in a way that justifies the UX tax.

We lose a tiny amount of information here. It hasn't mattered in practice — we have seen exactly three customer sites score zero in the first three months of the product, and in all three cases the critical issue count was between 7 and 12, not hundreds.

What we rejected

A few approaches we considered and dropped:

Normalizing by site size. The intuition is: a 10-URL site with 3 criticals is in worse shape than a 10,000-URL site with 3 criticals, because the density is higher. This is true. But it also punishes small sites for being small, and when we tested it, small sites with one unresolved issue ended up scoring lower than large sites with fifteen. That was unintuitive. Customers care about absolute count of fires, not the density.

Severity-weighted normalization. We tried taking the critical/warning/info count and dividing by total URLs scanned. This made the score independent of sitemap size, which was nice. But it produced tiny numbers (0.015) that required scaling, and the scaling introduced arbitrary constants that we couldn't explain. We cut it.

Time-decayed scoring. "Your score improves over time if you don't add new issues." Interesting idea — it creates a kind of reputational score. But it gets weird fast. What counts as "time"? Hours? Days since last scan? Decays how? Linear? Exponential? And customers would ask "why did my score improve without me doing anything?" which is not a question any monitoring product wants to answer with "it's complicated."

ML-based scoring. Train a model on "how many days did issues persist before customers fixed them" and weight issues by learned importance. I love this idea. I am not going to ship it. A score that users can't explain is a score users won't trust, and a black-box ML score is the worst version of that.

The alert-fatigue angle

The score drives our email alerts. We only email you when new critical or warning issues appear in a scan diff, but the severity classification of those issues comes from the same rule metadata that feeds the score. That means the score design affects the alert frequency.

Here's a trade-off we made explicitly. We could have classified some of our rules' issues as info-level. For example, a single slow response (over 2 seconds but under 5) could be info instead of warning. Doing that would make alerts less frequent — fewer warnings means fewer alert emails. But it also means the score wouldn't reflect the site's actual health the same way, and users would start ignoring the score as well as the emails.

Our rule is: if it's bad enough to affect the score meaningfully (weight × count > 5 points), it's bad enough to email about. This keeps the two in sync. Users who care about their score naturally care about their alerts, and we don't have to maintain two separate mental models of severity.

Tuning against real data

Once the formula stabilized, I spent a week in a Jupyter notebook comparing scores against customer self-reported "how bad is my site right now" ratings. We sent a survey to about forty customers: "On a scale of 1–10, how many fires do you feel you're fighting on your site right now?" Then we compared that against their current health score.

The correlation was about 0.72. Not amazing, but better than I expected. The disagreements were almost all in the same direction: customers rated their sites as healthier than the score did. This makes sense — they knew about their critical issues and had mentally excused them ("oh, that's in the backlog"). The score, correctly, had not.

We did not change the formula in response to this. The point of the score is to be correct, not to match user vibes. If the customer says they're fine and the score says they have four criticals, the score is doing its job.

What I'd change if we started over

One thing I would do differently: I would have shipped the raw issue counts prominently before shipping the score. The score is a summary, and summaries are more trustworthy when the underlying data is visible alongside them. In the current dashboard, the issue counts are there, but the score is the hero. I think that weighting is slightly off — the number draws the eye when the list of actual issues is what people should be acting on.

If I shipped v2, I'd put the issue counts as the headline and the score as a sidekick number. The score is useful for "is this trending better or worse," not for "what should I do." And "what should I do" is the actual question monitoring tools exist to answer.

The formula, again, as a closing

score = max(0, 100 − 15 × critical − 5 × warning − 1 × info)

It's three numbers and a clamp. It cost us six weeks of iteration. It will probably change again once we ship more rules. But the principle it encodes — that critical issues dominate the number, that the score degrades at a predictable rate, that clamping is better than showing negative numbers — is what I'd defend.

If you're building a health score for your own product, my strongest piece of advice is: ship the simplest possible formula. Resist the urge to normalize, ML-ify, or time-decay. Explainable beats sophisticated. Every piece of cleverness in your formula is a piece of support burden and a piece of user confusion.

Start with a weighted sum. Pick weights by intuition. Refine against real data. Ship. Iterate later.

Building a health score that actually means something

What we actually ship

The first version (which was bad)

The weighted-sum approach

The clamping question

What we rejected

The alert-fatigue angle

Tuning against real data

What I'd change if we started over

The formula, again, as a closing

How we built our content pipeline on AWS (and why we regret some of it)

Introducing Site Health Monitoring: catch SEO issues before they sink your rankings

The seven SEO issues that quietly kill rankings