Smartphones Cleaned Dataset (India, 2025)

ARTICLE

Project Files

Loading repository files...

📱 Project Overview

This repository contains the complete workflow behind building a clean, structured, and analysis-ready smartphone dataset for the Indian market.
The project starts from raw scraped HTML and unprocessed CSV files and ends with a fully curated dataset of 758 smartphones, suitable for data analysis and machine learning.

The goal of this project is to demonstrate real-world data engineering and data cleaning skills, not just model building.

🎯 What This Project Solves

Public smartphone data available online is often:

Inconsistent in format and naming conventions
Text-heavy and noisy with unstructured specifications
Contaminated with feature phones and irrelevant entries
Hard to use directly for analysis or machine learning

This project transforms that raw data into a reliable, machine-readable dataset by applying systematic cleaning, validation, and feature engineering techniques.

📂 Repository Structure

smartphones-dataset/
├── LICENSE
├── readme.md
├── requirements.txt
│
├── data/
│   ├── raw/
│   │   └── smartphones.csv          # 1007 extracted phones
│   └── processed/
│       └── smartphones_cleaned.csv  # 758 cleaned phones
│
├── images/                        # Visualizations & assets
│
├── notebooks/
│   ├── 01_extract_phones.ipynb      # HTML → CSV extraction
│   ├── 02_cleaning.ipynb            # Data cleaning & feature engineering
│   └── 03_eda_preview.ipynb         # Exploratory data analysis
│
├── scraped/
│   └── smartprix_phones.html        # Raw scraped HTML
│
└── scripts/
    └── scrape.py                     # Selenium-based scraper

🧩 Data Pipeline Explained

1️⃣ Data Extraction → `01_extract_phones.ipynb`

Raw smartphone listings were scraped from Smartprix.com (Indian marketplace)
Used Selenium with Chrome WebDriver for dynamic page loading
Infinite scroll automation to load all phone listings
Parsed HTML using BeautifulSoup to extract specifications
Extracted 1,007 phone entries with 11 raw features

Technology Stack:

selenium - Dynamic web scraping
beautifulsoup4 - HTML parsing
Custom CSS selectors for robust extraction

Files:
scripts/scrape.py → scraped/smartprix_phones.html → notebooks/01_extract_phones.ipynb → data/raw/smartphones.csv

2️⃣ Data Cleaning → `02_cleaning.ipynb`

This is the most intensive phase with 200+ lines of cleaning logic.

Key Operations:

Removed 249 feature phones (low specs, old OS, missing processors)
Standardized brand names (OPPO → Oppo, SAMSUNG → Samsung)
Removed currency symbols and formatting from prices
Handled missing values systematically
Converted text-based specs into numeric values

Validation Rules Applied:

Minimum battery: 3000 mAh
Minimum display: > 2.8 inches
Minimum RAM: > 32 MB
Valid processor required (Snapdragon, Dimensity, Helio, etc.)
Modern OS only (Android 4.0+, iOS)

Files:
data/raw/smartphones.csv (1007 rows) → notebooks/02_cleaning.ipynb → data/processed/smartphones_cleaned.csv (758 rows)

3️⃣ Feature Engineering → `02_cleaning.ipynb`

Transformed messy text columns into 29 well-defined features:

From Text to Structured Data:

Processor: Split into processor_brand, processor_name, core_count, clock_speed_ghz
Camera: Extracted rear_camera_count, front_camera_count, rear_camera_main_mp, front_camera_main_mp
Display: Split into display_inches, res_width_px, res_height_px, refresh_rate_hz
Connectivity: Boolean flags: has_5g, has_vo5g, has_volte, has_nfc, has_ir_blaster
Charging: Extracted battery_mah, charging_watt, fast_charging (boolean)
Storage: Created ram_gb, storage_gb
Memory Card: Parsed memory_card_supported (boolean) and memory_card_type

Regex-Based Extraction:

Camera megapixels: 50 MP + 8 MP + 2 MP → rear_camera_main_mp: 50
Display resolution: 1080 x 2400 → res_width_px: 1080, res_height_px: 2400
Processor specs: Octa core (2.84 GHz, Dual core, Kryo 680) → parsed into components
Battery charging: 5000 mAh, 44W Fast Charging → extracted both values

4️⃣ Exploratory Data Analysis → `03_eda_preview.ipynb`

Interactive visualizations using Plotly to understand:

Price distribution across brands
Processor performance vs price
Camera configuration trends
RAM and storage analysis
Feature correlation analysis
Brand positioning in market segments

Visualization Library:

plotly - Interactive charts
seaborn & matplotlib - Statistical plots

Output:
Comprehensive insights ready for portfolio presentation or further ML modeling

📊 Final Dataset Highlights

Dataset Statistics

758 smartphone models (smartphones only, feature phones excluded)
29 engineered features (from 11 raw columns)
Zero duplicate entries
Systematic missing value handling

Feature Categories

Category	Features
Basic Info	`brand`, `model`, `price_inr`, `rating_score`
Processor	`processor_brand`, `processor_name`, `core_count`, `clock_speed_ghz`
Memory	`ram_gb`, `storage_gb`
Display	`display_inches`, `res_width_px`, `res_height_px`, `refresh_rate_hz`
Battery	`battery_mah`, `charging_watt`, `fast_charging`
Camera	`rear_camera_count`, `front_camera_count`, `rear_camera_main_mp`, `front_camera_main_mp`
Connectivity	`has_5g`, `has_vo5g`, `has_volte`, `has_nfc`, `has_ir_blaster`
Software	`os_name`
Storage	`memory_card_supported`, `memory_card_type`

Data Quality Improvements

✅ Removed 249 feature phones and invalid entries
✅ Standardized brand names (35+ brands normalized)
✅ Extracted numeric values from text descriptions
✅ Created boolean flags for categorical features
✅ Structured nested specifications into flat schema
✅ Validated processor brands (Snapdragon, Dimensity, Helio, Unisoc, etc.)
✅ Parsed complex camera configurations

🚀 Potential Use Cases

Machine Learning

Price prediction using regression models
Feature importance analysis for pricing strategy
Brand positioning clustering
Recommendation systems

Data Analysis

Market trend analysis across brands
Hardware spec evolution over time
Value-for-money analysis
Camera and battery capacity trends

Portfolio Projects

EDA dashboards (Plotly, Streamlit, Dash)
Interactive price prediction apps
Teaching real-world data cleaning workflows
Practice dataset for beginners

🔗 Kaggle Dataset

The cleaned dataset is published on Kaggle:

👉 Smartphones Cleaned Dataset (India, 2025)

🛠️ Tech Stack & Dependencies

Core Libraries

pandas          # Data manipulation
numpy           # Numerical operations
beautifulsoup4  # HTML parsing
requests        # HTTP requests
selenium        # Web scraping automation
matplotlib      # Plotting
seaborn         # Statistical visualizations

Installation

# Clone the repository
git clone https://github.com/abhinavflac/smartphone-specs-india.git
cd smartphones-dataset

# Install dependencies
pip install -r requirements.txt

# Run the notebooks
jupyter notebook notebooks/

Note: Selenium requires ChromeDriver. Update the path in scripts/scrape.py if needed.

🧠 What This Project Demonstrates

Technical Skills

✅ Web Scraping - Selenium automation with infinite scroll handling
✅ HTML Parsing - BeautifulSoup with custom selectors
✅ Data Cleaning - 200+ lines of cleaning logic
✅ Regex Mastery - Complex pattern extraction from text
✅ Feature Engineering - Converting text to structured numeric data
✅ Data Validation - Business rule implementation for quality assurance
✅ ETL Pipeline - End-to-end data workflow

Data Science Best Practices

✅ Reproducible pipeline with clear documentation
✅ Raw data preservation (never modify source data)
✅ Systematic missing value handling
✅ Outlier detection and removal with justification
✅ Feature engineering guided by domain knowledge
✅ Data quality validation at each stage

This repository reflects practical data science work, where 80% of effort goes into understanding and preparing the data correctly, not just model building.

📜 License

This project is released under the CC0 (Public Domain) license.
You are free to use, modify, and distribute the data and code without restriction.

🤝 Contributions & Feedback

Suggestions, issues, or improvements are always welcome!
If you use this dataset in a project, feel free to share your work — I'd love to see it.

Possible Enhancements

Add more brands (international markets)
Include GPU specifications
Add camera sensor details
Expand connectivity features (WiFi versions, Bluetooth)
Include launch date for trend analysis

📧 Contact

For questions or collaboration:

Kaggle: githubmasterin
GitHub: Check repository issues and discussions

⭐ If this project helped you, consider giving it a star!

READY FOR THE NEXT ONE?

BACK TO WORK CONTACT

Discussion00