A blog about digital marketing & website optimization

More assorted improvements

Today we released yet some more assorted improvements to SiteCondor.  These are likely to be the last small enhancements out the door as we are now turning our attention to a couple major new features (stay tuned!).

Explore UI enhancements

We have improved the Explore screens to have better aligned tables and values, making more efficient use of the screen real estate.

Incremental 404 reporting

Following our previous improvement on 404 error reporting whereas we added the list of resources where the 404s originated from, we are now adding URLs to this “found at” list on the fly as they are discovered (as opposite to post-processing this after job completion).

Improved resource discovery

When we first launched SiteCondor, our crawler discovered links using a Javascript-based method that very closely resembled a web browser environment.  While this method had the advantage of being very accurate (the discovered resources were always correct), it was a little slow, used more computing resources than we liked, and sometimes missed some resources.

So initially we moved to a much faster, regular-expression based procedure. This method was less accurate (some discovered resources were not correct), but also much faster, used less computing resources, and discovered more resources. On the long run, however, the pitfall of this method was that for a few selected sites, those resources that were not valid represented a higher percentage and ended up generating lots of unnatural, incorrect 404s from trying to access incorrect URLs.

After spending more time on the issue, we finally came up with a method that has it all: it is fast, uses little computing resources, and it is very accurate (it is still Javascript-based and closely resembles a web browser environment).  The result is that we now have almost zero incorrect URLs discovered, an extremely clean 404 report, a faster crawling experience, and miss almost no resources.

Welcome to the SiteCondor Blog

SiteCondor is a website analysis tool for digital marketing experts. You should check it out.

Welcome to our blog, where we share ideas about website optimization, web dev, product updates, and more.

Create your account
Recent Posts

New features and improvements potpourri

Hello again!  Lately we’ve been busy reviewing SiteCondor’s feature set and the very valuable feedback from our beloved users (thank you!). As a result, we’ve decided to improve some key features and add some frequently requested ones. Below you’ll find a summary with the major updates. Feel free to try out SiteCondor and experience them first-hand, we’d love to know what you think.

Improved Search

If you’ve used SiteCondor in the past, you are likely familiar with the Explore menu and the different sections underneath it (Resources, Titles, Images, Meta Descriptions, Headings, Internal Links, External Links, URLs, Structured Data, Others, and XML Sitemap).  Underneath these sections, there are tabs presenting you with different aggregate views of those elements, enabling you to further slice and dice the data without the need to export to CSV and open up a spreadsheet application (albeit that’s also available).

This update improves the old, simple searching capabilities within each of the tabs into a much more powerful search: you can continue to use the previous “contains”-like searches, and you can also now run Regular Expression searches.  We also included an easier way to back out from search results along with a message clearly showing the search that triggered the current results. Here’s a quick example showing how to search for all Titles containing either “Austin” or “Work”:


Improved 404 error reporting

Our previous 404 section did not explicitly or conveniently display where the errors originated from (i.e.: where it found the broken links, broken images, etc.). You could get around this by triggering searches on the other sections, but that wasn’t a great user experience. So we decided to add a “found at” expandable section that lets you see the URLs where those broken resources were found right on the same screen (and in the exported CSV files as well):


New timeout error reporting

Most of the time, errors from web servers come back to our speedy crawler in nicely packaged, standards-compliant ways (i.e. with appropriate HTTP status codes, etc). But  let’s face it, s^&#*t happens. Network connections go down, somebody kicks a cable, web servers melt down, and zombies may attack anytime. When errors don’t come back to us in a timely manner, our crawler eventually gives up on that particular resource. Previously we were quietly ignoring these situations. We realized this was a problem, and so we are now reporting these errors within the Resources/Other Errors tab with a Status Code of 599. (This is not part of the RFC standard but rather a status code generally used to indicate network, client, or proxy timeouts.) If you see lots of 599 or 403 errors on your crawl results, you may try running your job again with less aggressive settings (generally speaking, dial down Concurrency, increase Throttling, and maybe increase Timeout as well).

New Job options: Max Resources, Disregard URL Query Strings, improved URL filter 

Upon request, our Job settings got a facelift too. We’ve moved up the protocol option, added a Max Resources option (limits the amount of resources to be used for the job, causing the crawler to stop early if the site is larger than the specified limit). Not only that, we’ve added an improved URL filter option supporting both “contains” and “regular expression” filtering, and a new Disregard Query Strings option. When enabled, the Disregard Query Strings setting will remove the query string part of the URL before requesting the resource. This is useful when crawling sites that make extensive use of query strings without necessarily returning different or interesting content.


Improved Job Summary display

Our Job Summary page now includes the new job options, and also presents the information in a clearer, easier to consume fashion (time units are used where appropriate, Yes/No used instead of true/false, etc).

Bug Fixes and improved crawler

As more users continue to create jobs to crawl different domains around the web, we continue to find situations where our crawler could improve the way it processes certain sites (particularly for sites with very poor markup, servers that aren’t very well behaved, or generally speaking for sites that aren’t crawl-friendly).  While we were making these updates, we also took care to  improve our crawler – making it both more accurate and faster, as well as giving the ability to  handle edge cases in better ways.

We hope you’ll enjoy this new release.  As usual, please stay in touch and let us know what you think. Best,


Adding Open Graph support

Following our recent introduction of structured data related features, we have just added support for Open Graph.  This means any new crawl jobs you create with SiteCondor will now look for and extract open graph meta data, enabling you to later explore, search, and export the collected information to CSV format.

Here’s a quick screenshot showing how this looks like for a couple of TechCrunch pages:


As seen on the screenshot, the feature also extracts Facebook-related meta information such as, for example, Facebook application identifier.

We hope you will enjoy this new feature! If there are other features you would like see implemented, don’t hesitate to drop us a line.

Introducing Structured Data Features

We’re pleased to announce the following structured data features: Microdata, Authorship, and Twitter Cards.

As you probably know Google and Bing are actively encouraging webmasters to utilize structured data so they can better understand the semantics of your site. Not only that, but Twitter and Facebook are pushing the use of Twitter Cards and Open Graph Tags.

SiteCondor now has a Structured Data section which includes support for Microdata, Authorship, and Twitter Cards.


SiteCondor Microdata Screenshot

An example of extracted microdata from the Food Network.


SiteCondor Authorship Screenshot

Within the Authorship tab, you’ll find author and publisher information as discovered by SiteCondor. The feature supports <a> and <link> elements with rel=authorship and rel=publisher (both as an attribute or as part of the URL).

Twitter Cards

SiteCondor Twitter Card Screenshot

Ever wonder why certain tweets have richly formatted content? Those are Twitter Cards, and SiteCondor now recognizes them, as shown in the screenshot above.

But wait, there’s more!

This update also includes numerous performance enhancements, bugfixes, and beefed up servers to power it all. This means you can crawl bigger sites that ever before, and faster.

Give the latest version of SiteCondor a try and let us know what you think.

Until next post!

– Seb & Judd