Laptops Cleaned Dataset (India, 2025)

ARTICLE

💻 Project Overview

This repository contains the complete workflow behind building a clean, structured, and analysis-ready laptop dataset for the Indian market.
The project starts from raw scraped HTML and unprocessed data and ends with a fully curated dataset of 992 laptops, suitable for data analysis, machine learning, and price prediction.

The goal of this project is to demonstrate real-world data engineering and data cleaning skills, with special emphasis on parsing complex CPU and GPU specifications.

Project Files

Loading repository files...

View on GitHub

📋 What This Project Solves

Public laptop data available online is often:

Unstructured with complex processor naming conventions (Intel Core Ultra, AMD Ryzen AI, Apple M-series)
Missing standardized GPU information (integrated vs dedicated, VRAM)
Difficult to parse for hybrid-core CPU architectures (P-cores, E-cores, LP E-cores)
Hard to use directly for analysis or machine learning

This project transforms that raw data into a reliable, machine-readable dataset by applying systematic cleaning, validation, and advanced feature engineering techniques.

📂 Repository Structure

laptops-dataset/
├── LICENSE
├── README.md
├── requirements.txt
│
├── data/
│   ├── raw/
│   │   └── laptops_raw.csv          # Original scraped data
│   └── processed/
│       └── laptops_cleaned.csv      # Cleaned dataset (992 laptops)
│
├── images/                           # Visualizations & assets
│
├── notebooks/
│   ├── 01_data_cleaning.ipynb       # Data cleaning & feature engineering
│   └── 02_eda.ipynb                 # Exploratory data analysis
│
├── scraped/
│   └── smartprix_laptops.html       # Raw scraped HTML
│
└── scripts/
    └── scrape.py                    # Selenium-based scraper

🧩 Data Pipeline Explained

1️⃣ Data Collection → `scripts/scrape.py`

Raw laptop listings were scraped from Smartprix.com (Indian marketplace)
Used Selenium with Chrome WebDriver for dynamic page loading
Infinite scroll automation to load all laptop listings
Collected 1,000+ laptop entries with raw specifications

Technology Stack:

selenium - Dynamic web scraping
Infinite scroll handling
HTML saved for reproducibility

Files:
scripts/scrape.py → scraped/smartprix_laptops.html

2️⃣ Data Cleaning → `notebooks/01_data_cleaning.ipynb`

This is the most intensive phase with 200+ lines of cleaning logic.

Key Operations:

Removed currency symbols from prices
Extracted brand from model names
Cleaned RAM/Storage values (converted TB to GB)
Handled missing values systematically
Standardized OS names

Validation Rules Applied:

Valid CPU required (Intel, AMD, Apple, Qualcomm)
Price range validation
Display size validation
Warranty data normalization

Files:
data/raw/laptops_raw.csv → notebooks/01_data_cleaning.ipynb → data/processed/laptops_cleaned.csv (992 rows)

3️⃣ Feature Engineering → `notebooks/01_data_cleaning.ipynb`

Transformed messy text columns into 27 well-defined features:

CPU Parsing (Complex):

Brand: Intel, AMD, Apple, Qualcomm
Family: Core, Ryzen, M-Series, Celeron
Series: Core i5, Ryzen 7, Core Ultra 7, M4 Pro
Model: 13420H, 7530U, M4 Pro
Suffix: U, H, HX, Pro, Max
Core Details: cpu_core_count, cpu_thread_count, cpu_p_cores, cpu_e_cores, cpu_lp_e_cores

GPU Parsing:

Brand: NVIDIA, AMD, Intel, Apple, Qualcomm
Series: GeForce RTX, Radeon, Iris Xe, Apple GPU
Model: RTX 4060, Radeon Graphics, Apple 10-Core GPU
VRAM: Dedicated GPU memory in GB
Type: Integrated vs Dedicated classification

Other Features:

Display: display_width_px, display_height_px, display_size_inch
Memory: ram_gb, storage_gb
System: os_name, warranty_years
Category: device_category (Gaming, Business, Thin & Light, General)

4️⃣ Exploratory Data Analysis → `notebooks/02_eda.ipynb`

Interactive visualizations to understand:

Price distribution across brands
CPU/GPU market share
Device category analysis
RAM and storage trends
Brand positioning in market segments

📊 Final Dataset Highlights

Dataset Statistics

992 laptop models (cleaned and validated)
27 engineered features (from raw columns)
15+ brands (Lenovo, HP, Dell, Asus, Acer, Apple, etc.)
Price range: ₹12,490 - ₹605,990

Feature Categories

Category	Features
Basic Info	`brand`, `model`, `device_category`, `price`, `rating`
CPU Details	`cpu_brand`, `cpu_family`, `cpu_series`, `cpu_model`, `cpu_suffix`, `cpu_core_count`, `cpu_thread_count`, `cpu_p_cores`, `cpu_e_cores`, `cpu_lp_e_cores`
Memory	`ram_gb`, `storage_gb`
Display	`display_width_px`, `display_height_px`, `display_size_inch`
GPU Details	`gpu_brand`, `gpu_series`, `gpu_model`, `gpu_vram_gb`, `gpu_type`
OS & Warranty	`os_name`, `warranty_years`

Market Statistics

Intel: 71% CPU market share
AMD: 23% CPU market share
NVIDIA: 29% of GPUs
Integrated GPUs: 70% of laptops
Windows: 94% of operating systems

Data Quality Improvements

✅ Extracted brand from model names
✅ Parsed complex processor names (Intel Core Ultra, AMD Ryzen AI, Apple M-series)
✅ Extracted numeric core counts from text (e.g., "20 Cores (8P + 12E)")
✅ Classified GPU as integrated vs dedicated
✅ Converted TB storage to GB
✅ Standardized OS names
✅ Handled missing warranty values

🚀 Potential Use Cases

Machine Learning

Price prediction using regression models
Feature importance analysis for pricing strategy
Brand positioning clustering
Value-for-money recommendation systems

Data Analysis

Market trend analysis across brands
Hardware spec evolution over time
CPU/GPU performance vs price analysis
Gaming laptop vs business laptop comparison

Portfolio Projects

EDA dashboards (Plotly, Streamlit, Dash)
Interactive price prediction apps
Teaching real-world data cleaning workflows
Practice dataset for beginners

🔗 Kaggle Dataset

The cleaned dataset is published on Kaggle:

👉 Laptops Cleaned Dataset

🛠️ Tech Stack & Dependencies

Core Libraries

pandas          # Data manipulation
numpy           # Numerical operations
jupyter         # Interactive notebooks
regex           # Pattern matching for CPU/GPU parsing

Installation

# Clone the repository
git clone https://github.com/abhinavflac/laptops-specs-dataset.git
cd laptops-specs-dataset

# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the notebooks
jupyter notebook notebooks/

🧠 What This Project Demonstrates

Technical Skills

✅ Web Scraping - Selenium automation with infinite scroll handling
✅ Data Cleaning - 200+ lines of cleaning logic
✅ Regex Mastery - Complex pattern extraction for CPU/GPU specs
✅ Feature Engineering - Text to structured data transformation
✅ Data Validation - Business rule implementation for quality assurance
✅ ETL Pipeline - End-to-end reproducible workflow

Data Science Best Practices

✅ Reproducible pipeline with clear documentation
✅ Raw data preservation (never modify source data)
✅ Systematic missing value handling
✅ Feature engineering guided by domain knowledge
✅ Data quality validation at each stage

This repository reflects practical data science work, where 80% of effort goes into understanding and preparing the data correctly, not just model building.

📜 License

This project is released under the CC0 (Public Domain) license.
You are free to use, modify, and distribute the data and code without restriction.

🤝 Contributions & Feedback

Suggestions, issues, or improvements are always welcome!
If you use this dataset in a project, feel free to share your work — I'd love to see it.

Possible Enhancements

Add more brands (international markets)
Include battery specifications
Add weight and dimensions
Include launch dates for trend analysis
Expand connectivity features (WiFi, Bluetooth versions)

📧 Contact

For questions or collaboration:

Kaggle: githubmasterin
GitHub: @abhinavflac

⭐ If this project helped you, consider giving it a star!

READY FOR THE NEXT ONE?

BACK TO WORK CONTACT

Discussion00