(Updated 14 February 2023).
Pulling in Google Search Console (GSC) data for a site through the search console application programming interface (API) instead of using the web user interface (UI) directly has become really popular. It’s easy to understand why. Making the keywords for which a site ranks available in BigQuery to Python code lets you build a broad range of transformational SEO applications on top, that would be impossible without access to the raw data.
The GSC UI shows benchmark statistics for a site that are neither sampled nor aggregated. However, pulling back keyword data is restricted to 1,000 records. That’s one reason so many people choose to use the Search Console API. GSC is one of the most popular SEO ingredients. If you’re looking for a Google SEO data API, the GSC API has it all. However, there are widespread misunderstandings around Google Search Console sampling and GSC API limits. How can you get full Google Search Console bulk data into Big Query?
Luckily, we’ve been testing this out at Similar.ai for more than a year. We’ve found the Search Console limits — and how you can export Google Search Console data from the API in bulk into Big Query. Read on to learn from our experiments and experience.
Yes, both Search Console UI and the Google Search Console API limit the data you can get back. For the UI that’s 1,000 records. The same dataset accessed via the GSC API has loss due to aggregation, the long tail data gets clipped and some personally identifiable information (PII) is omitted. For many smaller sites, you can get nearly all the keywords, impressions and clicks through the API. However, the larger sites get, the more this drops off. These are real Google Search Console API limits that can hurt your results. For instance, you can only get back 50,000 search console page-keyword pairs per GSC property per day.
One product SEO team we talked to said that that wasn’t a problem since very few of their pages after the first 1,000 had any clicks or impressions. This is a common misunderstanding: the data you get back isn’t the truth, it’s just a piece of the truth. That’s the nature of sampling. That site is ranking for lots more keywords, with many of those pages getting just one or two impressions and no clicks. However, they assume it’s not true because search console never talks about this. In fact, what they are seeing is mostly an artefact of the search console limits. If you’re wondering why your page doesn’t show for any keywords in Google’s best page ranking API, we may have found the reason.
I’m Robin Allenson. I’m the co-founder & CEO at Similar.ai. I’ve been working with my SEO automation company on original research on how to get the most Google keywords out of Google Search Console API. We tested as many different ways to get data out of the GSC API as we could. Finally, we found a way to measure what Google isn’t telling you: the Search Console Sampling Gap. Then, magically, we used that to test different approaches to getting all the SEO data by API: all the Search Console keywords you’ve been missing, all the clicks and all the impressions. We’ve found a simple way to get bulk data out of Google Search Console and export into our Big Query. We can help you do the same, effectively removing Google Search Console API limits.
The Search Console Sampling Gap is the difference between the clicks or impressions from keyword-page pairs from the API and the site level data. We can measure the gap by adding up the clicks and impressions of the page-keyword pairs we get from the API and comparing these to the benchmark statistics for the site that you’ll see in the GSC UI. We call this ratio the GSC Sampling Rate. The missing clicks or impressions are the GSC Sampling Gap. For large sites, the impressions sampling gap is typically around 66%. That’s huge! The bigger the site, the more of a problem this is.
What happens if this is not just a large site, but a huge site? It turns out the difference are remarkably consistent between what you get from the Google Search Console API and a full bulk data export of Google Search Console keywords. Google sampling limits your understanding of your own site.
We think that the reason that this is consistent between two different sizes of enterprise site is that the data retrieved for the site profile is measure of how big the fat tail is. However, I’ve personally spoken to product-led growth teams who have single digit % sample rates: they hit these SEO keyword limits more than 90% of the time. They need their enterprise SEO platform to scale, but it’s hard to do that when there’s no data in their Search Console: they were flying blind until they partnered with Similar AI to get full search console data in a bulk export.
How much of a problem missing data from the GSC API is depends on the downstream applications you’re using it for. If you’re primarily using GSC data for analytics and intelligence, maybe not much. However, if you want to use GSC data for SEO automation, such as cleaning up pages in which users won’t get a great experience, making use of internal linking opportunities to money pages or creating new pages for user needs a site is uniquely placed to answer, data gaps become critical. That’s for a couple of important reasons.
Some large sites have a directory structure that makes search console data console sampling very hit or miss. Missing data is not spread evenly throughout the site, but instead there is no data on specific pages or categories. Say that one of your downstream product SEO goals is to GSC data to identify relevant topics for which your site ranks poorly and lacks a dedicated category page. With this folder structure and the way the GSC API works for enterprise sites, there will be categories for which you miss any impression data and can never generate new page recommendations. If one of these categories in the 5% of categories on the site which drive all your revenue, that’s a huge problem.
The folder structure of enterprise sites does not explicitly mention their higher level categories in the directory path for lower level categories. This is very common. Probably missing out higher level categories in the URLs of deeper categories happens so often because so many sites nowadays are huge and so many types of enterprise sites have very deep site structures. Deep site structures are common in classifieds, affiliate sites, travel sites, real estate sites, eCommerce sites and marketplaces. Perhaps not mentioning higher level categories helps keep the URLs short enough when a site has thousands of categories.
For instance, Farfetch’s clutch category is https://www.farfetch.com/shopping/women/clutches-1/items.aspx even though the breadcrumbs are for Women > Bags > Clutch Bags. This means if you were to take the standard practice of adding properties for the main top level categories such as https://www.farfetch.com/shopping/women/, this will miss a lot of the data for the hundreds of directories underneath it in the folder structure. Here it’s not just a subtle Google Search API limit, it really means you’re going to be seeing no data from the GSC API when those pages are ranking and you should see GSC keywords. If your site suffers from this kind of deep structure, adding an internal linking structure around the needs of the user will help flatten it to grow your active pages.
We love site with deep (or flat) site structure, because they are no issue for our bulk data export for GSC! If that’s your site or your client’s, let us know, we can still turn on the Google Search Console firehose.
The second place that missing GSC data through the API is a problem is when tracking the impact of product SEO changes. Growing organic revenue is at the bottom of a lot of different cause & effects. We’ve talked to a lot of SEO teams who use their internal data analytics team to measure the impact of their initiatives. That makes sense as these teams can use a consistent measure approach for all marketing initiatives. The number one approach we hear of is using a causal impact analysis of traffic. This analysis requires that the experiment run for months to gather sufficient data, that the analysis itself takes weeks — and that the majority of outcomes are that the experiments were inconclusive.
Different product SEO experiments are aimed at solving different pieces of the SEO puzzle. Optimising a page by adding meta-data content or adding semantic mark-up is likely to improve the CTR. Adding frequently-asked questions to a long-tail transactional page is likely to improve the number of keywords for which it ranks and that influx of new keywords is likely to reduce the average position. Internal linking, such as boosting links to pages which have traffic opportunity, can improve the position and have an outsize effect on growing organic revenue. Each of these can often be measured in days or weeks with solid GSC data. This bulk data export is both “more robust and more sensitive” in the words of the head of analytics & machine learning for a huge enterprise site.
Just as importantly: it is a lot faster. The time to learn from SEO experimentation automation is decimated. For product-led teams, this ability to learn quickly from multiple experiments, to learn what works in your vertical and each geo is a superpower. And it all stems from having accurate, granular and comprehensive GSC data as a bulk data export for every page and category on your site. Not having that data can you leave you significantly slower at learning what works than your organic competition.
The Google Search Console UI has a limit of 1,000 rows, but retrieving data through the API is subject to a number of quotas. Some of these are at site level, others at account level and others at property level. You can only get back 50,000 pairs of pages & keywords per search console property each day. Adding new search console properties means that it’s possible to get more data back through the API. We use the our technique of measuring the sample gap to track how much the click gap, impressions gap or keywords gap is shrunk as we add more properties. Just how much can we reduce the gap?
It turns out that if we know which properties to add, we can already dramatically reduce the gap. By carefully selecting 50 properties and then adding them to our example site, we reduced the impressions lost from retrieving data through the API six-fold from a 67% gap down to just 11%.
That’s amazing and amazingly useful for product-led SEO teams, especially those with deep site structures. But how far can we take this approach? It turns out that there is a law of diminishing returns because of the enormous long tail.
The more profiles we add, the less impressions and clicks we get in return. More and more sub-directories yield less and less data to shrink the API gap. Although this process is slow and tedious, it’s important, as you can build up nearly all the first-party search engine data for an enterprise.
The Similar.ai platform is able to get nearly all the data you can get from Google (as we noted before, there is some data that Google will always withhold) for any site, no matter the size.
There is another amazing benefit. I’ve spoken to tens of product-led growth teams and almost all of them are unaware of this. Although with each additional profile the reduction in the impression or click sampling gap is impacted less, the number of keywords increases dramatically. Both of these are because of the long tail. For a recent customer, we increased the number of keywords we retrieved by 13.7x. You too can close the keyword gap — which is basically closing the SEO gap for many product-led growth companies.
One great application of using this order of magnitude improvement in the number of keywords you get for your site is to build AI for SEO. In AI, more training data means a better quality result. You can use all those new keywords to improve the quality of machine learning classifiers to disambiguate keywords and for many other enterprise SEO product features.
Curious about what search console limits mean for you in practice and just how much GSC data you miss? We can measure that for your site. If you’d like to skip the queue and get a dependable bulk data export for Google Search Console from the experts in the field, please reach out, and we’ll set you up! Just reach out here.
We’d also love to swap notes on how leading product growth teams use GSC data for internal linking, clean-up, new page creation, automated content generation or experimentation & measurement. There is nothing we like better than to chat about all the problems, struggles and successes product-led SEO entails.
We’re working with clients around the world to maximise their SEO data ingredients. Our platform integrates first-party and third-party data; plays nicely with crawlers, log file analysers and keyword data suppliers; automating recipes & keeping everything up-to-date; connecting pages data with SERP data, keyword data, topic & knowledge graph data; automating conclusions about what to do with every page on your site; all to let you experiment and scale your SEO initiatives.