Hello again! Lately we’ve been busy reviewing SiteCondor’s feature set and the very valuable feedback from our beloved users (thank you!). As a result, we’ve decided to improve some key features and add some frequently requested ones. Below you’ll find a summary with the major updates. Feel free to try out SiteCondor and experience them first-hand, we’d love to know what you think.
If you’ve used SiteCondor in the past, you are likely familiar with the Explore menu and the different sections underneath it (Resources, Titles, Images, Meta Descriptions, Headings, Internal Links, External Links, URLs, Structured Data, Others, and XML Sitemap). Underneath these sections, there are tabs presenting you with different aggregate views of those elements, enabling you to further slice and dice the data without the need to export to CSV and open up a spreadsheet application (albeit that’s also available).
This update improves the old, simple searching capabilities within each of the tabs into a much more powerful search: you can continue to use the previous “contains”-like searches, and you can also now run Regular Expression searches. We also included an easier way to back out from search results along with a message clearly showing the search that triggered the current results. Here’s a quick example showing how to search for all Titles containing either “Austin” or “Work”:
Improved 404 error reporting
Our previous 404 section did not explicitly or conveniently display where the errors originated from (i.e.: where it found the broken links, broken images, etc.). You could get around this by triggering searches on the other sections, but that wasn’t a great user experience. So we decided to add a “found at” expandable section that lets you see the URLs where those broken resources were found right on the same screen (and in the exported CSV files as well):
New timeout error reporting
Most of the time, errors from web servers come back to our speedy crawler in nicely packaged, standards-compliant ways (i.e. with appropriate HTTP status codes, etc). But let’s face it, s^&#*t happens. Network connections go down, somebody kicks a cable, web servers melt down, and zombies may attack anytime. When errors don’t come back to us in a timely manner, our crawler eventually gives up on that particular resource. Previously we were quietly ignoring these situations. We realized this was a problem, and so we are now reporting these errors within the Resources/Other Errors tab with a Status Code of 599. (This is not part of the RFC standard but rather a status code generally used to indicate network, client, or proxy timeouts.) If you see lots of 599 or 403 errors on your crawl results, you may try running your job again with less aggressive settings (generally speaking, dial down Concurrency, increase Throttling, and maybe increase Timeout as well).
New Job options: Max Resources, Disregard URL Query Strings, improved URL filter
Upon request, our Job settings got a facelift too. We’ve moved up the protocol option, added a Max Resources option (limits the amount of resources to be used for the job, causing the crawler to stop early if the site is larger than the specified limit). Not only that, we’ve added an improved URL filter option supporting both “contains” and “regular expression” filtering, and a new Disregard Query Strings option. When enabled, the Disregard Query Strings setting will remove the query string part of the URL before requesting the resource. This is useful when crawling sites that make extensive use of query strings without necessarily returning different or interesting content.
Improved Job Summary display
Our Job Summary page now includes the new job options, and also presents the information in a clearer, easier to consume fashion (time units are used where appropriate, Yes/No used instead of true/false, etc).
Bug Fixes and improved crawler
As more users continue to create jobs to crawl different domains around the web, we continue to find situations where our crawler could improve the way it processes certain sites (particularly for sites with very poor markup, servers that aren’t very well behaved, or generally speaking for sites that aren’t crawl-friendly). While we were making these updates, we also took care to improve our crawler – making it both more accurate and faster, as well as giving the ability to handle edge cases in better ways.
We hope you’ll enjoy this new release. As usual, please stay in touch and let us know what you think. Best,