Detecting hosting providers / by Murray Hurps

Last month we completed the surveying work for Startup Muster 2013, the largest survey ever of the Australian startup scene.

In two weeks 430 Australian startup founders completed the 55 question survey, and we’ve been buried in cleanup, validation, analysis since.

One problem we have is trying to figure out the total number of startups in Australia.

The best idea I’ve had so far is to:

  • Detect the hosting company used by each startup
  • Approach one of the large hosting companies for the total number of startups using their hosting in Australia.
  • Divide by the market share detected to estimate the total number of startups.

To do this, I first had to detect the hosting providers.

All screenshots below were done with example data for privacy reasons.

Gathering domain data

Using OpenRefine, create a new project and import the domains you want to analyze.

image

Run the following Node.js script to provide some local endpoints for DNS/rDNS/TLD lookups:

var dns = require ('dns'),
    app = require ('express') ();

app.get ('/dns/:target', function (req, res)
{
  var t = req.params.target;

  if (t.match (/^[\d.]+$/))
  {
    // IP
    return dns.reverse (t, function (error, domains)
    {
      res.end ((domains || []).join (','));
    });
  }

  dns.lookup (t, 4, function (error, address)
  {
    res.end (address);
  });
});

app.get ('/tld/:target', function (req, res)
{
  res.end (require ('tld').registered (req.params.target || ''));
});

require ('http').createServer (app).listen (5000);

Click the Domain column -> Edit column -> Add column by fetching URLs.

image

Click on the new IP column -> Edit column -> Add column by fetching URLs.

image

We now have the IP and reverse DNS lookup result. The latter is more useful if it’s just a TLD, so click on the ReversedDomain column -> Edit column -> Add column by fetching URLs.

image

Retrieving registration data

We’ll now retrieve IP assignment data from ARIN. Click on the IP column -> Edit column -> Add column by fetching URLs.

image

Note that this time we have a throttle delay value to avoid hitting their endpoint too quickly.

We now have the registration data in JSON format. To extract the organization, click the DataFromARIN column -> Edit column -> Add column based on this column.

image

Handling regional registries

Some IPs are delegated to organizations other than ARIN, so to handle this click the OrganizationFromARIN column -> Facet -> Text facet.

Hover over the Asia Pacific Network Information Centre entry (APNIC), then click Include.

image

You’ll notice that now only these rows are shown. Any actions you take at this point will only affect the rows displayed.

Click the IP column again -> Edit column -> Add column by fetching URLs.

image

Sadly the HTML data from APNIC isn’t as easy to digest as the JSON data from ARIN.

Click the “DataFromAPNIC” column -> Edit column -> Add column based on column, and brace yourself for a regular expression.

image

Depending on the kind of targets you have, you may need to handle other providers, such as RIPE for European IPs, but for this example we don’t need to.

Click the X on the OrganizationFromARIN facet to remove it, all rows should now be shown.

We’ll now combine the two Organization columns into a single one. Click the OrganizationFromARIN column -> Edit column -> Add column based on this column.

image

Detecting shared data centers

The reason we did the reverse domain lookup is to detect hosting companies that use other company’s data centers.

First clean up the displayed columns by clicking the All column -> View -> Collapse all columns, then hover over the column titles, clicking the Domain, Organization and ReversedTLD columns to show them again.

Click the Organization column -> Facet -> Text facet, then click Count to order by popularity.

Then do the same for the ReversedTLD column.

image

Hover over the “Amazon.com, Inc” entry in the top facet, then click Include.

You’ll see the bottom facet is then updated to show only the values associated with Amazon:

image

This happens because Heroku uses Amazon data centers.

In the main data display, hover over a value in the displayed Organization column -> Click “edit” -> Enter “Heroku” -> Click “Apply to all identical cells”. As your display is currently faceted, this will only apply to the Heroku ones.

image

Repeat these steps for the most common hosting companies to check they all have the expected reverse domain lookups.

Done!

All going well, your Organization facet should end up with some useful results.

image

For larger data sets, it’s useful to click Cluster and merge similarly-named organizations.

Our analysis also included a traceroute to each startup URL and analysis of each IP along the way, achieved with a modified version of the above Node.js script, but for simple hosting detection this is overkill.

We’re currently working on getting one of the large hosting companies on board to share the total number of Australian startups they have as customers. With this, we can finally provide a good estimate of the true size of what we can see developing around us.

This is just one small aspect of our analysis, and we’re very excited about what we’ve found so far. We’re looking forward to releasing the Startup Muster report soon, along with our final anonymized data set.

Click here to be notified of news on Startup Muster, or follow @Murray.