Robots.txt Analysis

Sensitive paths in robots.txt advertise them to attackers. Use real auth or noindex; reserve robots.txt for crawl management.

robots.txt is a search-engine instruction, not a security control. Anything listed in a Disallow: directive is announced to every visitor, including attackers reading robots.txt as their first reconnaissance step. Disallowing /admin/ tells search engines not to index it AND tells attackers exactly where to point their authentication brute-forcer. This check fetches /robots.txt and deducts points when sensitive paths appear in Disallow lines.

The check also looks for a /.well-known/security.txt file as a positive informational signal (RFC 9116, vulnerability reporting contact). It does not affect the score; it shows up as security_txt_note in the evidence.

How the check works

Per scan, RedScore fetches https://yourdomain.tld/robots.txt (falls back to http if needed) and parses Disallow: lines. Each path is checked against a list of sensitive tokens: admin, internal, staging, debug, test, backup, private, secret, config, dashboard, panel, manage, phpinfo, server-status, server-info, cgi-bin, plus any path matching /api/v<digits>. Each sensitive hit deducts 2 points (capped at 12 of 15).

Tier absent_or_no_disallow (1.0 score): no robots.txt published, or robots.txt has no Disallow lines.
Tier standard (1.0 score): robots.txt has Disallow lines, but none match sensitive tokens.
Tier sensitive (variable score): one or more Disallow lines name sensitive tokens. Reason: robots_sensitive_paths. Deduction: 2 per hit, max 12 (1-3 hits warn, 6+ hits fail).

Verdict thresholds: pass at 0.9 and above (1-2 sensitive hits or fewer), warn at 0.45 and above (3-5 hits), fail below (6+ hits).

How the verdict maps to evidence

Pass: no Disallow at all, only standard non-sensitive Disallows, or up to 2 sensitive hits.
Warn: 3-5 sensitive Disallows.
Fail: 6 or more sensitive Disallows. The robots.txt is effectively a roadmap.

Evidence shows robots_http_status, disallow_directive_count, sensitive_pattern_hits, the scoring tier, and (separately) any security.txt note.

Special states

Degraded: probe data unavailable. Fix Web Assessability first.

Fix: stop using robots.txt for security

robots.txt is purely advisory. Well-behaved bots respect it; attackers, scrapers, and any visitor in a browser do not. Sensitive endpoints need real protections instead.

1. Remove sensitive paths from Disallow lines

If your robots.txt currently looks like the bad version below, edit it. Keep robots.txt for actual crawl management (excluding duplicate-content paths, draft pages, large infinite-scroll archives) and remove any path that names internal tooling.

BAD: announces sensitive endpoints

User-agent: *
Disallow: /admin/
Disallow: /internal/api/
Disallow: /staging/
Disallow: /debug/
Disallow: /backup/
Disallow: /config/
Disallow: /dashboard/
Disallow: /api/v1/admin/

GOOD: minimal robots.txt for crawl management

User-agent: *
Disallow: /search/
Disallow: /cart/
Disallow: /print/
Sitemap: https://yourdomain.tld/sitemap.xml

2. Use real protections for sensitive endpoints

Authentication: require login for /admin, /dashboard, internal tooling. Whether bots index the URL becomes irrelevant if visiting requires a session.
IP allow-lists at the edge: lock /admin to office or VPN ranges via Cloudflare WAF rules, AWS WAF, your reverse-proxy ACL, or origin firewall.
noindex meta tag: <meta name="robots" content="noindex, nofollow"> in the page's <head>. Tells search engines not to index without revealing the URL in robots.txt.
X-Robots-Tag header: same effect as the meta tag, applied via response header. Useful for non-HTML responses (PDFs, images): X-Robots-Tag: noindex, nofollow.

X-Robots-Tag header in nginx

location /admin/ {
    add_header X-Robots-Tag "noindex, nofollow" always;
    auth_basic "Admin";
    auth_basic_user_file /etc/nginx/.htpasswd;
    # plus IP allow-list:
    allow 198.51.100.0/24;     # office VPN
    deny all;
    proxy_pass http://backend;
}

3. Publish security.txt for the bonus signal

Publish /.well-known/security.txt per RFC 9116 with a vulnerability reporting contact. The check does not score it, but the evidence note flags its presence as good practice.

/.well-known/security.txt

Contact: mailto:security@yourdomain.tld
Contact: https://yourdomain.tld/security
Expires: 2027-01-01T00:00:00.000Z
Encryption: https://yourdomain.tld/.well-known/security.asc
Acknowledgments: https://yourdomain.tld/security/hall-of-fame
Preferred-Languages: en
Canonical: https://yourdomain.tld/.well-known/security.txt
Policy: https://yourdomain.tld/security/policy

Expires must be in the future; renew the file before that date. Many bug bounty platforms (HackerOne, Bugcrowd, Intigriti) accept the security.txt format directly.

Verify the fix

curl -s https://yourdomain.tld/robots.txt and read every Disallow line. Confirm no path names admin, internal, staging, debug, backup, private, config, dashboard, panel, or api/v<n>.
curl -sI https://yourdomain.tld/admin/ should return 401, 403, or a redirect to login (not 200). The endpoint is protected regardless of whether robots.txt mentions it.
curl -s https://yourdomain.tld/.well-known/security.txt should return your contact info if you published one.
Re-run the RedScore lookup. Pass requires 2 or fewer sensitive Disallow hits.

Common pitfalls

Removing sensitive Disallows but leaving the endpoints public. The check stops flagging robots.txt, but the endpoints are still reachable. Pair removal with real auth.
Using robots.txt as documentation. Some teams use Disallow: lines to remind themselves which paths exist. Use a real internal wiki; do not document your attack surface in a public file.
Allow: directives that re-expose what Disallow hides. robots.txt parsers treat Allow + Disallow per most-specific rule. Re-allowing /admin/public via Allow: /admin/public/ creates the same fingerprint as Disallow: /admin/private/.
Wildcard Disallow paths. Disallow: /*admin* matches any URL containing the word admin. The pattern itself reveals you have admin endpoints somewhere on the site, even if no specific path is named.
robots.txt linking to your sitemap that lists everything anyway. If your sitemap.xml exposes URLs you wanted hidden, removing them from robots.txt does not help. Audit the sitemap too.
Treating noindex as a security control. noindex prevents indexing in compliant search engines; it does NOT prevent fetching. Anyone visiting the URL still gets the content. Pair with auth or IP allow-list.

What to do next

See how these recommendations apply to your site's current scan results.

Scan domain