When Curiosity Meets Automation 🚀
Back in 2020, I found myself repeatedly looking up apps and their developers across both the App Store and Play Store for market research. Manual searches were time-consuming and inefficient, so I decided to automate the process. The result? A Node.js-based scraper that not only collects app data but also finds developer contact information using Hunter.io's API.
Technical Architecture
Core Components
The project is split into two main modules:
- App Store scraper (
index-as.js
) - Play Store scraper (
index-ps.js
)
Here's how the App Store module is structured:
var store = require("app-store-scraper");
const parseDomain = require("parse-domain");
// CSV Writer Configuration
const csvWriter = createCsvWriter({
path: "doc/appstore/[TEST][TOP_FREE_IOS][LIFESTYLE]app-store-scrapper.csv",
header: [
{ id: "appName", title: "App Name" },
{ id: "publisherName", title: "Publisher Name" },
// ... other headers
],
});
Smart Email Discovery 🎯
One of the most interesting challenges was extracting developer emails. I implemented two approaches:
- Direct extraction from app descriptions:
function extractEmails(text) {
return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);
}
- Domain-based discovery using Hunter.io:
const hunter = new HunterSDK(process.env.HUNTER_API_KEY);
// Domain parsing and email discovery
if (app.developerWebsite) {
parsedDomain = parseDomain(app.developerWebsite);
realDomainFromPublisherDomain =
parsedDomain.domain + "." + parsedDomain.tld;
hunter.domainSearch({
domain: realDomainFromPublisherDomain,
}, function(err, body) {
// Process results
});
}
Handling Store Differences
The App Store and Play Store have different data structures and APIs. Here's how I handled it:
App Store Implementation
store.list({
collection: store.collection.TOP_FREE_IOS,
category: store.category.LIFESTYLE,
num: 200,
fullDetail: true,
})
.then(function(apps) {
apps.forEach((app) => {
// Process app data
});
});
Play Store Implementation
gplay.list({
category: gplay.category.SPORTS,
collection: gplay.collection.TRENDING,
num: 500,
country: "fr",
fullDetail: true,
lang: "fr",
price: "free",
})
.then(function(apps) {
// Similar processing logic
});
Challenges and Solutions 💪
Rate Limiting
Both stores and Hunter.io have rate limits. I implemented delays and batch processing:
// Pause between requests to avoid rate limiting
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
async function processApps(apps) {
for (const app of apps) {
await processApp(app);
await delay(2000); // 2 second delay between requests
}
}
Data Consistency
Store APIs return different data structures. I normalized them into a consistent format:
const normalizedData = {
appName: app.title,
publisherName: app.developer,
publisherDomain: app.developerWebsite,
// ... other normalized fields
};
Future Improvements 🔮
Looking back at this 2020 project, here are some improvements I'd make today:
- Add TypeScript for better type safety
- Implement retry mechanisms for failed requests
- Add proxy support for better rate limit handling
- Create a configuration file for easier customization
- Add support for modern app store features
Technical Learnings 📚
This project taught me valuable lessons about:
- API integration and rate limiting
- Data normalization across different sources
- Regular expressions for email extraction
- Environment variable management
- CSV data export
Conclusion
While this 2020 project needs some updates to work with current store APIs, it demonstrates the power of automation in market research. The core concepts of data scraping, API integration, and information normalization remain relevant today.
Check out the complete source code on GitHub if you're interested in learning more or contributing to the project's evolution.