Scanning a Website for Broken Links in Go
Yes, I know there are paid and free tools for doing this. And yes, I know there are tools for this that I can run locally.
But this exercise allowed me to try out the well-designed Go package github.com/gocolly/colly.
Colly is a web scraping framework for Go.
Here is how I used it to quickly scan my website (the one you are on right now) for broken links.
First I defined a type for links to check and the URL of the page they appear on:
type link struct {
Url string
PageUrl string
}
I also wrote a rudimentary function to check if a link is okay:
func checkLink(l link) bool {
req, err := http.NewRequest("GET", l.Url, nil)
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0")
if err != nil {
return false
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
return false
}
defer resp.Body.Close()
io.Copy(io.Discard, resp.Body)
return resp.StatusCode >= 200 && resp.StatusCode < 400
}
Make sure to set a timeout on the default HTTP client.
http.DefaultClient.Timeout = 10 * time.Second
Next, let us define a worker function to check links as they are scanned from the website:
func checkLinks(links <-chan link) {
seen := make(map[string]bool)
for l := range links {
_, ok := seen[l.Url]
if !ok {
seen[l.Url] = true
if !checkLink(l) {
// The link is broken, print an error message indicating the URL of the page it is on
fmt.Printf("Broken link %s on page %s\n", l.Url, l.PageUrl)
}
}
}
}
Finally, the function to crawl the website:
func crawl(domain string, links chan link) {
c := colly.NewCollector(
colly.AllowedDomains(domain), // Limit crawling to the site being scanned only
colly.Async(true),
)
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2}) // Limit scan speed/parallelism
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
link = e.Request.AbsoluteURL(link)
if link == "" {
return
}
links <- link{Url: link, PageUrl: e.Request.URL.String()}
e.Request.Visit(link)
})
c.Visit("https://" + domain) // Start the crawl from the homepage
c.Wait()
}
And finally, the main function to weave it all together:
func main() {
http.DefaultClient.Timeout = 10 * time.Second
links := make(chan link, 100)
go checkLinks(links)
crawl("hjr265.me", links)
}
And that’s it. I can run this program to identify any broken links on my website.
Colly, an easy-to-use scraping framework, makes it possible to do more than just detect broken links. I can want to perform other routine audits, like checking to see if my images are missing alt attributes, if my pages have the correct meta tags, and more.
This post is 97th of my #100DaysToOffload challenge. Want to get involved? Find out more at 100daystooffload.com.