Loading Assets
0%
HOMEWORK0BLOG0ABOUTFAQ
BACK TO WORK

Smartphones Cleaned Dataset (India, 2025)

Smartphones Cleaned Dataset (India, 2025)

Project Files

Loading repository files...

View on GitHub

šŸ“± Project Overview

This repository contains the complete workflow behind building a clean, structured, and analysis-ready smartphone dataset for the Indian market.
The project starts from raw scraped HTML and unprocessed CSV files and ends with a fully curated dataset of 758 smartphones, suitable for data analysis and machine learning.

The goal of this project is to demonstrate real-world data engineering and data cleaning skills, not just model building.

šŸŽÆ What This Project Solves

Public smartphone data available online is often:

  • Inconsistent in format and naming conventions
  • Text-heavy and noisy with unstructured specifications
  • Contaminated with feature phones and irrelevant entries
  • Hard to use directly for analysis or machine learning

This project transforms that raw data into a reliable, machine-readable dataset by applying systematic cleaning, validation, and feature engineering techniques.

šŸ“‚ Repository Structure

smartphones-dataset/
ā”œā”€ā”€ LICENSE
ā”œā”€ā”€ readme.md
ā”œā”€ā”€ requirements.txt
│
ā”œā”€ā”€ data/
│   ā”œā”€ā”€ raw/
│   │   └── smartphones.csv          # 1007 extracted phones
│   └── processed/
│       └── smartphones_cleaned.csv  # 758 cleaned phones
│
ā”œā”€ā”€ images/                        # Visualizations & assets
│
ā”œā”€ā”€ notebooks/
│   ā”œā”€ā”€ 01_extract_phones.ipynb      # HTML → CSV extraction
│   ā”œā”€ā”€ 02_cleaning.ipynb            # Data cleaning & feature engineering
│   └── 03_eda_preview.ipynb         # Exploratory data analysis
│
ā”œā”€ā”€ scraped/
│   └── smartprix_phones.html        # Raw scraped HTML
│
└── scripts/
    └── scrape.py                     # Selenium-based scraper

🧩 Data Pipeline Explained

1ļøāƒ£ Data Extraction → 01_extract_phones.ipynb

  • Raw smartphone listings were scraped from Smartprix.com (Indian marketplace)
  • Used Selenium with Chrome WebDriver for dynamic page loading
  • Infinite scroll automation to load all phone listings
  • Parsed HTML using BeautifulSoup to extract specifications
  • Extracted 1,007 phone entries with 11 raw features

Technology Stack:

  • selenium - Dynamic web scraping
  • beautifulsoup4 - HTML parsing
  • Custom CSS selectors for robust extraction

Files:
scripts/scrape.py → scraped/smartprix_phones.html → notebooks/01_extract_phones.ipynb → data/raw/smartphones.csv

2ļøāƒ£ Data Cleaning → 02_cleaning.ipynb

This is the most intensive phase with 200+ lines of cleaning logic.

Key Operations:

  • Removed 249 feature phones (low specs, old OS, missing processors)
  • Standardized brand names (OPPO → Oppo, SAMSUNG → Samsung)
  • Removed currency symbols and formatting from prices
  • Handled missing values systematically
  • Converted text-based specs into numeric values

Validation Rules Applied:

  • Minimum battery: 3000 mAh
  • Minimum display: > 2.8 inches
  • Minimum RAM: > 32 MB
  • Valid processor required (Snapdragon, Dimensity, Helio, etc.)
  • Modern OS only (Android 4.0+, iOS)

Files:
data/raw/smartphones.csv (1007 rows) → notebooks/02_cleaning.ipynb → data/processed/smartphones_cleaned.csv (758 rows)

3ļøāƒ£ Feature Engineering → 02_cleaning.ipynb

Transformed messy text columns into 29 well-defined features:

From Text to Structured Data:

  • Processor: Split into processor_brand, processor_name, core_count, clock_speed_ghz
  • Camera: Extracted rear_camera_count, front_camera_count, rear_camera_main_mp, front_camera_main_mp
  • Display: Split into display_inches, res_width_px, res_height_px, refresh_rate_hz
  • Connectivity: Boolean flags: has_5g, has_vo5g, has_volte, has_nfc, has_ir_blaster
  • Charging: Extracted battery_mah, charging_watt, fast_charging (boolean)
  • Storage: Created ram_gb, storage_gb
  • Memory Card: Parsed memory_card_supported (boolean) and memory_card_type

Regex-Based Extraction:

  • Camera megapixels: 50 MP + 8 MP + 2 MP → rear_camera_main_mp: 50
  • Display resolution: 1080 x 2400 → res_width_px: 1080, res_height_px: 2400
  • Processor specs: Octa core (2.84 GHz, Dual core, Kryo 680) → parsed into components
  • Battery charging: 5000 mAh, 44W Fast Charging → extracted both values

4ļøāƒ£ Exploratory Data Analysis → 03_eda_preview.ipynb

Interactive visualizations using Plotly to understand:

  • Price distribution across brands
  • Processor performance vs price
  • Camera configuration trends
  • RAM and storage analysis
  • Feature correlation analysis
  • Brand positioning in market segments

Visualization Library:

  • plotly - Interactive charts
  • seaborn & matplotlib - Statistical plots

Output:
Comprehensive insights ready for portfolio presentation or further ML modeling

šŸ“Š Final Dataset Highlights

Dataset Statistics

  • 758 smartphone models (smartphones only, feature phones excluded)
  • 29 engineered features (from 11 raw columns)
  • Zero duplicate entries
  • Systematic missing value handling

Feature Categories

CategoryFeatures
Basic Infobrand, model, price_inr, rating_score
Processorprocessor_brand, processor_name, core_count, clock_speed_ghz
Memoryram_gb, storage_gb
Displaydisplay_inches, res_width_px, res_height_px, refresh_rate_hz
Batterybattery_mah, charging_watt, fast_charging
Camerarear_camera_count, front_camera_count, rear_camera_main_mp, front_camera_main_mp
Connectivityhas_5g, has_vo5g, has_volte, has_nfc, has_ir_blaster
Softwareos_name
Storagememory_card_supported, memory_card_type

Data Quality Improvements

āœ… Removed 249 feature phones and invalid entries
āœ… Standardized brand names (35+ brands normalized)
āœ… Extracted numeric values from text descriptions
āœ… Created boolean flags for categorical features
āœ… Structured nested specifications into flat schema
āœ… Validated processor brands (Snapdragon, Dimensity, Helio, Unisoc, etc.)
āœ… Parsed complex camera configurations

šŸš€ Potential Use Cases

Machine Learning

  • Price prediction using regression models
  • Feature importance analysis for pricing strategy
  • Brand positioning clustering
  • Recommendation systems

Data Analysis

  • Market trend analysis across brands
  • Hardware spec evolution over time
  • Value-for-money analysis
  • Camera and battery capacity trends

Portfolio Projects

  • EDA dashboards (Plotly, Streamlit, Dash)
  • Interactive price prediction apps
  • Teaching real-world data cleaning workflows
  • Practice dataset for beginners

šŸ”— Kaggle Dataset

The cleaned dataset is published on Kaggle:

šŸ‘‰ Smartphones Cleaned Dataset (India, 2025)

šŸ› ļø Tech Stack & Dependencies

Core Libraries

pandas          # Data manipulation
numpy           # Numerical operations
beautifulsoup4  # HTML parsing
requests        # HTTP requests
selenium        # Web scraping automation
matplotlib      # Plotting
seaborn         # Statistical visualizations

Installation

# Clone the repository
git clone https://github.com/abhinavflac/smartphone-specs-india.git
cd smartphones-dataset

# Install dependencies
pip install -r requirements.txt

# Run the notebooks
jupyter notebook notebooks/

Note: Selenium requires ChromeDriver. Update the path in scripts/scrape.py if needed.

🧠 What This Project Demonstrates

Technical Skills

āœ… Web Scraping - Selenium automation with infinite scroll handling
āœ… HTML Parsing - BeautifulSoup with custom selectors
āœ… Data Cleaning - 200+ lines of cleaning logic
āœ… Regex Mastery - Complex pattern extraction from text
āœ… Feature Engineering - Converting text to structured numeric data
āœ… Data Validation - Business rule implementation for quality assurance
āœ… ETL Pipeline - End-to-end data workflow

Data Science Best Practices

āœ… Reproducible pipeline with clear documentation
āœ… Raw data preservation (never modify source data)
āœ… Systematic missing value handling
āœ… Outlier detection and removal with justification
āœ… Feature engineering guided by domain knowledge
āœ… Data quality validation at each stage

This repository reflects practical data science work, where 80% of effort goes into understanding and preparing the data correctly, not just model building.

šŸ“œ License

This project is released under the CC0 (Public Domain) license.
You are free to use, modify, and distribute the data and code without restriction.

šŸ¤ Contributions & Feedback

Suggestions, issues, or improvements are always welcome!
If you use this dataset in a project, feel free to share your work — I'd love to see it.

Possible Enhancements

  • Add more brands (international markets)
  • Include GPU specifications
  • Add camera sensor details
  • Expand connectivity features (WiFi versions, Bluetooth)
  • Include launch date for trend analysis

šŸ“§ Contact

For questions or collaboration:

⭐ If this project helped you, consider giving it a star!

READY FOR THE NEXT ONE?
Discussion00

Sign in to join the discussion.