Skip to main content

Building a Dual Store Scraper: How I Automated App Market Research 🔍

00:02:41:39

When Curiosity Meets Automation 🚀

Back in 2020, I found myself repeatedly looking up apps and their developers across both the App Store and Play Store for market research. Manual searches were time-consuming and inefficient, so I decided to automate the process. The result? A Node.js-based scraper that not only collects app data but also finds developer contact information using Hunter.io's API.

Technical Architecture

Core Components

The project is split into two main modules:

  • App Store scraper (index-as.js)
  • Play Store scraper (index-ps.js)

Here's how the App Store module is structured:

javascript
var store = require("app-store-scraper");
const parseDomain = require("parse-domain");

// CSV Writer Configuration
const csvWriter = createCsvWriter({
  path: "doc/appstore/[TEST][TOP_FREE_IOS][LIFESTYLE]app-store-scrapper.csv",
  header: [
    { id: "appName", title: "App Name" },
    { id: "publisherName", title: "Publisher Name" },
    // ... other headers
  ],
});

Smart Email Discovery 🎯

One of the most interesting challenges was extracting developer emails. I implemented two approaches:

  1. Direct extraction from app descriptions:
javascript
function extractEmails(text) {
  return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);
}
  1. Domain-based discovery using Hunter.io:
javascript
const hunter = new HunterSDK(process.env.HUNTER_API_KEY);

// Domain parsing and email discovery
if (app.developerWebsite) {
  parsedDomain = parseDomain(app.developerWebsite);
  realDomainFromPublisherDomain = 
    parsedDomain.domain + "." + parsedDomain.tld;
    
  hunter.domainSearch({
    domain: realDomainFromPublisherDomain,
  }, function(err, body) {
    // Process results
  });
}

Handling Store Differences

The App Store and Play Store have different data structures and APIs. Here's how I handled it:

App Store Implementation

javascript
store.list({
  collection: store.collection.TOP_FREE_IOS,
  category: store.category.LIFESTYLE,
  num: 200,
  fullDetail: true,
})
.then(function(apps) {
  apps.forEach((app) => {
    // Process app data
  });
});

Play Store Implementation

javascript
gplay.list({
  category: gplay.category.SPORTS,
  collection: gplay.collection.TRENDING,
  num: 500,
  country: "fr",
  fullDetail: true,
  lang: "fr",
  price: "free",
})
.then(function(apps) {
  // Similar processing logic
});

Challenges and Solutions 💪

Rate Limiting

Both stores and Hunter.io have rate limits. I implemented delays and batch processing:

javascript
// Pause between requests to avoid rate limiting
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function processApps(apps) {
  for (const app of apps) {
    await processApp(app);
    await delay(2000); // 2 second delay between requests
  }
}

Data Consistency

Store APIs return different data structures. I normalized them into a consistent format:

javascript
const normalizedData = {
  appName: app.title,
  publisherName: app.developer,
  publisherDomain: app.developerWebsite,
  // ... other normalized fields
};

Future Improvements 🔮

Looking back at this 2020 project, here are some improvements I'd make today:

  1. Add TypeScript for better type safety
  2. Implement retry mechanisms for failed requests
  3. Add proxy support for better rate limit handling
  4. Create a configuration file for easier customization
  5. Add support for modern app store features

Technical Learnings 📚

This project taught me valuable lessons about:

  1. API integration and rate limiting
  2. Data normalization across different sources
  3. Regular expressions for email extraction
  4. Environment variable management
  5. CSV data export

Conclusion

While this 2020 project needs some updates to work with current store APIs, it demonstrates the power of automation in market research. The core concepts of data scraping, API integration, and information normalization remain relevant today.

Check out the complete source code on GitHub if you're interested in learning more or contributing to the project's evolution.