š» Project Overview
This repository contains the complete workflow behind building a clean, structured, and analysis-ready laptop dataset for the Indian market.
The project starts from raw scraped HTML and unprocessed data and ends with a fully curated dataset of 992 laptops, suitable for data analysis, machine learning, and price prediction.
The goal of this project is to demonstrate real-world data engineering and data cleaning skills, with special emphasis on parsing complex CPU and GPU specifications.
Project Files
Loading repository files...
š What This Project Solves
Public laptop data available online is often:
- Unstructured with complex processor naming conventions (Intel Core Ultra, AMD Ryzen AI, Apple M-series)
- Missing standardized GPU information (integrated vs dedicated, VRAM)
- Difficult to parse for hybrid-core CPU architectures (P-cores, E-cores, LP E-cores)
- Hard to use directly for analysis or machine learning
This project transforms that raw data into a reliable, machine-readable dataset by applying systematic cleaning, validation, and advanced feature engineering techniques.
š Repository Structure
laptops-dataset/
āāā LICENSE
āāā README.md
āāā requirements.txt
ā
āāā data/
ā āāā raw/
ā ā āāā laptops_raw.csv # Original scraped data
ā āāā processed/
ā āāā laptops_cleaned.csv # Cleaned dataset (992 laptops)
ā
āāā images/ # Visualizations & assets
ā
āāā notebooks/
ā āāā 01_data_cleaning.ipynb # Data cleaning & feature engineering
ā āāā 02_eda.ipynb # Exploratory data analysis
ā
āāā scraped/
ā āāā smartprix_laptops.html # Raw scraped HTML
ā
āāā scripts/
āāā scrape.py # Selenium-based scraperš§© Data Pipeline Explained
1ļøā£ Data Collection ā scripts/scrape.py
- Raw laptop listings were scraped from Smartprix.com (Indian marketplace)
- Used Selenium with Chrome WebDriver for dynamic page loading
- Infinite scroll automation to load all laptop listings
- Collected 1,000+ laptop entries with raw specifications
Technology Stack:
selenium- Dynamic web scraping- Infinite scroll handling
- HTML saved for reproducibility
Files:scripts/scrape.py ā scraped/smartprix_laptops.html
2ļøā£ Data Cleaning ā notebooks/01_data_cleaning.ipynb
This is the most intensive phase with 200+ lines of cleaning logic.
Key Operations:
- Removed currency symbols from prices
- Extracted brand from model names
- Cleaned RAM/Storage values (converted TB to GB)
- Handled missing values systematically
- Standardized OS names
Validation Rules Applied:
- Valid CPU required (Intel, AMD, Apple, Qualcomm)
- Price range validation
- Display size validation
- Warranty data normalization
Files:data/raw/laptops_raw.csv ā notebooks/01_data_cleaning.ipynb ā data/processed/laptops_cleaned.csv (992 rows)
3ļøā£ Feature Engineering ā notebooks/01_data_cleaning.ipynb
Transformed messy text columns into 27 well-defined features:
CPU Parsing (Complex):
- Brand: Intel, AMD, Apple, Qualcomm
- Family: Core, Ryzen, M-Series, Celeron
- Series: Core i5, Ryzen 7, Core Ultra 7, M4 Pro
- Model: 13420H, 7530U, M4 Pro
- Suffix: U, H, HX, Pro, Max
- Core Details:
cpu_core_count,cpu_thread_count,cpu_p_cores,cpu_e_cores,cpu_lp_e_cores
GPU Parsing:
- Brand: NVIDIA, AMD, Intel, Apple, Qualcomm
- Series: GeForce RTX, Radeon, Iris Xe, Apple GPU
- Model: RTX 4060, Radeon Graphics, Apple 10-Core GPU
- VRAM: Dedicated GPU memory in GB
- Type: Integrated vs Dedicated classification
Other Features:
- Display:
display_width_px,display_height_px,display_size_inch - Memory:
ram_gb,storage_gb - System:
os_name,warranty_years - Category:
device_category(Gaming, Business, Thin & Light, General)
4ļøā£ Exploratory Data Analysis ā notebooks/02_eda.ipynb
Interactive visualizations to understand:
- Price distribution across brands
- CPU/GPU market share
- Device category analysis
- RAM and storage trends
- Brand positioning in market segments
š Final Dataset Highlights
Dataset Statistics
- 992 laptop models (cleaned and validated)
- 27 engineered features (from raw columns)
- 15+ brands (Lenovo, HP, Dell, Asus, Acer, Apple, etc.)
- Price range: ā¹12,490 - ā¹605,990
Feature Categories
| Category | Features |
|---|---|
| Basic Info | brand, model, device_category, price, rating |
| CPU Details | cpu_brand, cpu_family, cpu_series, cpu_model, cpu_suffix, cpu_core_count, cpu_thread_count, cpu_p_cores, cpu_e_cores, cpu_lp_e_cores |
| Memory | ram_gb, storage_gb |
| Display | display_width_px, display_height_px, display_size_inch |
| GPU Details | gpu_brand, gpu_series, gpu_model, gpu_vram_gb, gpu_type |
| OS & Warranty | os_name, warranty_years |
Market Statistics
- Intel: 71% CPU market share
- AMD: 23% CPU market share
- NVIDIA: 29% of GPUs
- Integrated GPUs: 70% of laptops
- Windows: 94% of operating systems
Data Quality Improvements
ā
Extracted brand from model names
ā
Parsed complex processor names (Intel Core Ultra, AMD Ryzen AI, Apple M-series)
ā
Extracted numeric core counts from text (e.g., "20 Cores (8P + 12E)")
ā
Classified GPU as integrated vs dedicated
ā
Converted TB storage to GB
ā
Standardized OS names
ā
Handled missing warranty values
š Potential Use Cases
Machine Learning
- Price prediction using regression models
- Feature importance analysis for pricing strategy
- Brand positioning clustering
- Value-for-money recommendation systems
Data Analysis
- Market trend analysis across brands
- Hardware spec evolution over time
- CPU/GPU performance vs price analysis
- Gaming laptop vs business laptop comparison
Portfolio Projects
- EDA dashboards (Plotly, Streamlit, Dash)
- Interactive price prediction apps
- Teaching real-world data cleaning workflows
- Practice dataset for beginners
š Kaggle Dataset
The cleaned dataset is published on Kaggle:
š ļø Tech Stack & Dependencies
Core Libraries
pandas # Data manipulation
numpy # Numerical operations
jupyter # Interactive notebooks
regex # Pattern matching for CPU/GPU parsingInstallation
# Clone the repository
git clone https://github.com/abhinavflac/laptops-specs-dataset.git
cd laptops-specs-dataset
# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the notebooks
jupyter notebook notebooks/š§ What This Project Demonstrates
Technical Skills
ā
Web Scraping - Selenium automation with infinite scroll handling
ā
Data Cleaning - 200+ lines of cleaning logic
ā
Regex Mastery - Complex pattern extraction for CPU/GPU specs
ā
Feature Engineering - Text to structured data transformation
ā
Data Validation - Business rule implementation for quality assurance
ā
ETL Pipeline - End-to-end reproducible workflow
Data Science Best Practices
ā
Reproducible pipeline with clear documentation
ā
Raw data preservation (never modify source data)
ā
Systematic missing value handling
ā
Feature engineering guided by domain knowledge
ā
Data quality validation at each stage
This repository reflects practical data science work, where 80% of effort goes into understanding and preparing the data correctly, not just model building.
š License
This project is released under the CC0 (Public Domain) license.
You are free to use, modify, and distribute the data and code without restriction.
š¤ Contributions & Feedback
Suggestions, issues, or improvements are always welcome!
If you use this dataset in a project, feel free to share your work ā I'd love to see it.
Possible Enhancements
- Add more brands (international markets)
- Include battery specifications
- Add weight and dimensions
- Include launch dates for trend analysis
- Expand connectivity features (WiFi, Bluetooth versions)
š§ Contact
For questions or collaboration:
- Kaggle: githubmasterin
- GitHub: @abhinavflac
ā If this project helped you, consider giving it a star!

Sign in to join the discussion.