Project Files
Loading repository files...
š± Project Overview
This repository contains the complete workflow behind building a clean, structured, and analysis-ready smartphone dataset for the Indian market.
The project starts from raw scraped HTML and unprocessed CSV files and ends with a fully curated dataset of 758 smartphones, suitable for data analysis and machine learning.
The goal of this project is to demonstrate real-world data engineering and data cleaning skills, not just model building.
šÆ What This Project Solves
Public smartphone data available online is often:
- Inconsistent in format and naming conventions
- Text-heavy and noisy with unstructured specifications
- Contaminated with feature phones and irrelevant entries
- Hard to use directly for analysis or machine learning
This project transforms that raw data into a reliable, machine-readable dataset by applying systematic cleaning, validation, and feature engineering techniques.
š Repository Structure
smartphones-dataset/
āāā LICENSE
āāā readme.md
āāā requirements.txt
ā
āāā data/
ā āāā raw/
ā ā āāā smartphones.csv # 1007 extracted phones
ā āāā processed/
ā āāā smartphones_cleaned.csv # 758 cleaned phones
ā
āāā images/ # Visualizations & assets
ā
āāā notebooks/
ā āāā 01_extract_phones.ipynb # HTML ā CSV extraction
ā āāā 02_cleaning.ipynb # Data cleaning & feature engineering
ā āāā 03_eda_preview.ipynb # Exploratory data analysis
ā
āāā scraped/
ā āāā smartprix_phones.html # Raw scraped HTML
ā
āāā scripts/
āāā scrape.py # Selenium-based scraperš§© Data Pipeline Explained
1ļøā£ Data Extraction ā 01_extract_phones.ipynb
- Raw smartphone listings were scraped from Smartprix.com (Indian marketplace)
- Used Selenium with Chrome WebDriver for dynamic page loading
- Infinite scroll automation to load all phone listings
- Parsed HTML using BeautifulSoup to extract specifications
- Extracted 1,007 phone entries with 11 raw features
Technology Stack:
selenium- Dynamic web scrapingbeautifulsoup4- HTML parsing- Custom CSS selectors for robust extraction
Files:scripts/scrape.py ā scraped/smartprix_phones.html ā notebooks/01_extract_phones.ipynb ā data/raw/smartphones.csv
2ļøā£ Data Cleaning ā 02_cleaning.ipynb
This is the most intensive phase with 200+ lines of cleaning logic.
Key Operations:
- Removed 249 feature phones (low specs, old OS, missing processors)
- Standardized brand names (OPPO ā Oppo, SAMSUNG ā Samsung)
- Removed currency symbols and formatting from prices
- Handled missing values systematically
- Converted text-based specs into numeric values
Validation Rules Applied:
- Minimum battery: 3000 mAh
- Minimum display: > 2.8 inches
- Minimum RAM: > 32 MB
- Valid processor required (Snapdragon, Dimensity, Helio, etc.)
- Modern OS only (Android 4.0+, iOS)
Files:data/raw/smartphones.csv (1007 rows) ā notebooks/02_cleaning.ipynb ā data/processed/smartphones_cleaned.csv (758 rows)
3ļøā£ Feature Engineering ā 02_cleaning.ipynb
Transformed messy text columns into 29 well-defined features:
From Text to Structured Data:
- Processor: Split into
processor_brand,processor_name,core_count,clock_speed_ghz - Camera: Extracted
rear_camera_count,front_camera_count,rear_camera_main_mp,front_camera_main_mp - Display: Split into
display_inches,res_width_px,res_height_px,refresh_rate_hz - Connectivity: Boolean flags:
has_5g,has_vo5g,has_volte,has_nfc,has_ir_blaster - Charging: Extracted
battery_mah,charging_watt,fast_charging(boolean) - Storage: Created
ram_gb,storage_gb - Memory Card: Parsed
memory_card_supported(boolean) andmemory_card_type
Regex-Based Extraction:
- Camera megapixels:
50 MP + 8 MP + 2 MPārear_camera_main_mp: 50 - Display resolution:
1080 x 2400āres_width_px: 1080,res_height_px: 2400 - Processor specs:
Octa core (2.84 GHz, Dual core, Kryo 680)ā parsed into components - Battery charging:
5000 mAh, 44W Fast Chargingā extracted both values
4ļøā£ Exploratory Data Analysis ā 03_eda_preview.ipynb
Interactive visualizations using Plotly to understand:
- Price distribution across brands
- Processor performance vs price
- Camera configuration trends
- RAM and storage analysis
- Feature correlation analysis
- Brand positioning in market segments
Visualization Library:
plotly- Interactive chartsseaborn&matplotlib- Statistical plots
Output:
Comprehensive insights ready for portfolio presentation or further ML modeling
š Final Dataset Highlights
Dataset Statistics
- 758 smartphone models (smartphones only, feature phones excluded)
- 29 engineered features (from 11 raw columns)
- Zero duplicate entries
- Systematic missing value handling
Feature Categories
| Category | Features |
|---|---|
| Basic Info | brand, model, price_inr, rating_score |
| Processor | processor_brand, processor_name, core_count, clock_speed_ghz |
| Memory | ram_gb, storage_gb |
| Display | display_inches, res_width_px, res_height_px, refresh_rate_hz |
| Battery | battery_mah, charging_watt, fast_charging |
| Camera | rear_camera_count, front_camera_count, rear_camera_main_mp, front_camera_main_mp |
| Connectivity | has_5g, has_vo5g, has_volte, has_nfc, has_ir_blaster |
| Software | os_name |
| Storage | memory_card_supported, memory_card_type |
Data Quality Improvements
ā
Removed 249 feature phones and invalid entries
ā
Standardized brand names (35+ brands normalized)
ā
Extracted numeric values from text descriptions
ā
Created boolean flags for categorical features
ā
Structured nested specifications into flat schema
ā
Validated processor brands (Snapdragon, Dimensity, Helio, Unisoc, etc.)
ā
Parsed complex camera configurations
š Potential Use Cases
Machine Learning
- Price prediction using regression models
- Feature importance analysis for pricing strategy
- Brand positioning clustering
- Recommendation systems
Data Analysis
- Market trend analysis across brands
- Hardware spec evolution over time
- Value-for-money analysis
- Camera and battery capacity trends
Portfolio Projects
- EDA dashboards (Plotly, Streamlit, Dash)
- Interactive price prediction apps
- Teaching real-world data cleaning workflows
- Practice dataset for beginners
š Kaggle Dataset
The cleaned dataset is published on Kaggle:
š Smartphones Cleaned Dataset (India, 2025)
š ļø Tech Stack & Dependencies
Core Libraries
pandas # Data manipulation
numpy # Numerical operations
beautifulsoup4 # HTML parsing
requests # HTTP requests
selenium # Web scraping automation
matplotlib # Plotting
seaborn # Statistical visualizationsInstallation
# Clone the repository
git clone https://github.com/abhinavflac/smartphone-specs-india.git
cd smartphones-dataset
# Install dependencies
pip install -r requirements.txt
# Run the notebooks
jupyter notebook notebooks/Note: Selenium requires ChromeDriver. Update the path in scripts/scrape.py if needed.
š§ What This Project Demonstrates
Technical Skills
ā
Web Scraping - Selenium automation with infinite scroll handling
ā
HTML Parsing - BeautifulSoup with custom selectors
ā
Data Cleaning - 200+ lines of cleaning logic
ā
Regex Mastery - Complex pattern extraction from text
ā
Feature Engineering - Converting text to structured numeric data
ā
Data Validation - Business rule implementation for quality assurance
ā
ETL Pipeline - End-to-end data workflow
Data Science Best Practices
ā
Reproducible pipeline with clear documentation
ā
Raw data preservation (never modify source data)
ā
Systematic missing value handling
ā
Outlier detection and removal with justification
ā
Feature engineering guided by domain knowledge
ā
Data quality validation at each stage
This repository reflects practical data science work, where 80% of effort goes into understanding and preparing the data correctly, not just model building.
š License
This project is released under the CC0 (Public Domain) license.
You are free to use, modify, and distribute the data and code without restriction.
š¤ Contributions & Feedback
Suggestions, issues, or improvements are always welcome!
If you use this dataset in a project, feel free to share your work ā I'd love to see it.
Possible Enhancements
- Add more brands (international markets)
- Include GPU specifications
- Add camera sensor details
- Expand connectivity features (WiFi versions, Bluetooth)
- Include launch date for trend analysis
š§ Contact
For questions or collaboration:
- Kaggle: githubmasterin
- GitHub: Check repository issues and discussions
ā If this project helped you, consider giving it a star!

Sign in to join the discussion.