In my not-so-new job I work on Nexmo's developer portal and that means a lot of documents, a lot of links, just a lot to keep track of! One thing I worry about is changing something and breaking links from somewhere else, so I wanted to be able to check for existing links, broken links, and to include internal links like
http://example.com/home#something as well since all our titles are linkable in that way.
This was a brilliant and easy tool and these notes are mostly for my own reference as I had to figure a few things out as I went along.
Finding broken links
This tool can spider through your site, follow all the links, and show you any that are broken (including those internal links unless you specifically turn it off). It does cache the results so it's not hitting that cookies policy linked from your footer for every page it checks!
muffet -c 4 --exclude linkedin [url] | tee links.txt
Setting the concurrency very low seemed to help get through the link checking without issue when running on my laptop. I'm really not sure what the right settings are here but I had success with this one.
I'm excluding LinkedIn here because we link it on every page and it returns a status code 999 to spiders.
Tools are everything: I'll give a shout out to
tee which is a utility that both outputs to the terminal and writes output to a file. Once you have the file, it outputs the page the tool is on followed by a list of links and their status codes. I found that once I had the file, I could work with
grep to find particular patterns of links I was interested in. Also if there's something showing up that you don't care about,
grep -v [pattern] will exclude it from your grepped results.
I also loved using
wc -l links.txt to get an immediate sense of how many errors we have (it's not an accurate count because the file includes the page titles as well as the failed links but it gives you a sense of scale)
Identifying links to one site from another
Like most organisations, we have more than one website and it can be easy to miss when a change in one would cause a broken link on the other. For this I used muffet's
-v switch to show me ALL links, not just the broken ones.
muffet -v -c 4 --exclude linkedin [url] | tee all-links.txt
This shows all the links and enables me to build a map of the links from one site. Then I take the file and look at just the ones I'm interested in (the ones on that developer portal I mentioned) with a command like this:
grep "developer.nexmo.com" all-links.txt | sort | uniq
And now I can see everything that links in to the site that I should be mindful of (or that we already broke, oops! Luckily there weren't many of those).
Hopefully if you have a similar requirement, this tool could help you too. I'm not sure I'd run it as a build step as it takes a long time but I'm considering scheduling it to do a regular check on sites. I'd be interested to hear how others are using this tool too.