Loading Assets
0%
HOMEWORK0BLOG0ABOUTFAQ
BACK TO WORK

Laptops Cleaned Dataset (India, 2025)

Laptops Cleaned Dataset (India, 2025)

šŸ’» Project Overview

This repository contains the complete workflow behind building a clean, structured, and analysis-ready laptop dataset for the Indian market.
The project starts from raw scraped HTML and unprocessed data and ends with a fully curated dataset of 992 laptops, suitable for data analysis, machine learning, and price prediction.

The goal of this project is to demonstrate real-world data engineering and data cleaning skills, with special emphasis on parsing complex CPU and GPU specifications.

Project Files

Loading repository files...

View on GitHub

šŸ“‹ What This Project Solves

Public laptop data available online is often:

  • Unstructured with complex processor naming conventions (Intel Core Ultra, AMD Ryzen AI, Apple M-series)
  • Missing standardized GPU information (integrated vs dedicated, VRAM)
  • Difficult to parse for hybrid-core CPU architectures (P-cores, E-cores, LP E-cores)
  • Hard to use directly for analysis or machine learning

This project transforms that raw data into a reliable, machine-readable dataset by applying systematic cleaning, validation, and advanced feature engineering techniques.

šŸ“‚ Repository Structure

laptops-dataset/
ā”œā”€ā”€ LICENSE
ā”œā”€ā”€ README.md
ā”œā”€ā”€ requirements.txt
│
ā”œā”€ā”€ data/
│   ā”œā”€ā”€ raw/
│   │   └── laptops_raw.csv          # Original scraped data
│   └── processed/
│       └── laptops_cleaned.csv      # Cleaned dataset (992 laptops)
│
ā”œā”€ā”€ images/                           # Visualizations & assets
│
ā”œā”€ā”€ notebooks/
│   ā”œā”€ā”€ 01_data_cleaning.ipynb       # Data cleaning & feature engineering
│   └── 02_eda.ipynb                 # Exploratory data analysis
│
ā”œā”€ā”€ scraped/
│   └── smartprix_laptops.html       # Raw scraped HTML
│
└── scripts/
    └── scrape.py                    # Selenium-based scraper

🧩 Data Pipeline Explained

1ļøāƒ£ Data Collection → scripts/scrape.py

  • Raw laptop listings were scraped from Smartprix.com (Indian marketplace)
  • Used Selenium with Chrome WebDriver for dynamic page loading
  • Infinite scroll automation to load all laptop listings
  • Collected 1,000+ laptop entries with raw specifications

Technology Stack:

  • selenium - Dynamic web scraping
  • Infinite scroll handling
  • HTML saved for reproducibility

Files:
scripts/scrape.py → scraped/smartprix_laptops.html

2ļøāƒ£ Data Cleaning → notebooks/01_data_cleaning.ipynb

This is the most intensive phase with 200+ lines of cleaning logic.

Key Operations:

  • Removed currency symbols from prices
  • Extracted brand from model names
  • Cleaned RAM/Storage values (converted TB to GB)
  • Handled missing values systematically
  • Standardized OS names

Validation Rules Applied:

  • Valid CPU required (Intel, AMD, Apple, Qualcomm)
  • Price range validation
  • Display size validation
  • Warranty data normalization

Files:
data/raw/laptops_raw.csv → notebooks/01_data_cleaning.ipynb → data/processed/laptops_cleaned.csv (992 rows)

3ļøāƒ£ Feature Engineering → notebooks/01_data_cleaning.ipynb

Transformed messy text columns into 27 well-defined features:

CPU Parsing (Complex):

  • Brand: Intel, AMD, Apple, Qualcomm
  • Family: Core, Ryzen, M-Series, Celeron
  • Series: Core i5, Ryzen 7, Core Ultra 7, M4 Pro
  • Model: 13420H, 7530U, M4 Pro
  • Suffix: U, H, HX, Pro, Max
  • Core Details: cpu_core_count, cpu_thread_count, cpu_p_cores, cpu_e_cores, cpu_lp_e_cores

GPU Parsing:

  • Brand: NVIDIA, AMD, Intel, Apple, Qualcomm
  • Series: GeForce RTX, Radeon, Iris Xe, Apple GPU
  • Model: RTX 4060, Radeon Graphics, Apple 10-Core GPU
  • VRAM: Dedicated GPU memory in GB
  • Type: Integrated vs Dedicated classification

Other Features:

  • Display: display_width_px, display_height_px, display_size_inch
  • Memory: ram_gb, storage_gb
  • System: os_name, warranty_years
  • Category: device_category (Gaming, Business, Thin & Light, General)

4ļøāƒ£ Exploratory Data Analysis → notebooks/02_eda.ipynb

Interactive visualizations to understand:

  • Price distribution across brands
  • CPU/GPU market share
  • Device category analysis
  • RAM and storage trends
  • Brand positioning in market segments

šŸ“Š Final Dataset Highlights

Dataset Statistics

  • 992 laptop models (cleaned and validated)
  • 27 engineered features (from raw columns)
  • 15+ brands (Lenovo, HP, Dell, Asus, Acer, Apple, etc.)
  • Price range: ₹12,490 - ₹605,990

Feature Categories

CategoryFeatures
Basic Infobrand, model, device_category, price, rating
CPU Detailscpu_brand, cpu_family, cpu_series, cpu_model, cpu_suffix, cpu_core_count, cpu_thread_count, cpu_p_cores, cpu_e_cores, cpu_lp_e_cores
Memoryram_gb, storage_gb
Displaydisplay_width_px, display_height_px, display_size_inch
GPU Detailsgpu_brand, gpu_series, gpu_model, gpu_vram_gb, gpu_type
OS & Warrantyos_name, warranty_years

Market Statistics

  • Intel: 71% CPU market share
  • AMD: 23% CPU market share
  • NVIDIA: 29% of GPUs
  • Integrated GPUs: 70% of laptops
  • Windows: 94% of operating systems

Data Quality Improvements

āœ… Extracted brand from model names
āœ… Parsed complex processor names (Intel Core Ultra, AMD Ryzen AI, Apple M-series)
āœ… Extracted numeric core counts from text (e.g., "20 Cores (8P + 12E)")
āœ… Classified GPU as integrated vs dedicated
āœ… Converted TB storage to GB
āœ… Standardized OS names
āœ… Handled missing warranty values

šŸš€ Potential Use Cases

Machine Learning

  • Price prediction using regression models
  • Feature importance analysis for pricing strategy
  • Brand positioning clustering
  • Value-for-money recommendation systems

Data Analysis

  • Market trend analysis across brands
  • Hardware spec evolution over time
  • CPU/GPU performance vs price analysis
  • Gaming laptop vs business laptop comparison

Portfolio Projects

  • EDA dashboards (Plotly, Streamlit, Dash)
  • Interactive price prediction apps
  • Teaching real-world data cleaning workflows
  • Practice dataset for beginners

šŸ”— Kaggle Dataset

The cleaned dataset is published on Kaggle:

šŸ‘‰ Laptops Cleaned Dataset

šŸ› ļø Tech Stack & Dependencies

Core Libraries

pandas          # Data manipulation
numpy           # Numerical operations
jupyter         # Interactive notebooks
regex           # Pattern matching for CPU/GPU parsing

Installation

# Clone the repository
git clone https://github.com/abhinavflac/laptops-specs-dataset.git
cd laptops-specs-dataset

# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the notebooks
jupyter notebook notebooks/

🧠 What This Project Demonstrates

Technical Skills

āœ… Web Scraping - Selenium automation with infinite scroll handling
āœ… Data Cleaning - 200+ lines of cleaning logic
āœ… Regex Mastery - Complex pattern extraction for CPU/GPU specs
āœ… Feature Engineering - Text to structured data transformation
āœ… Data Validation - Business rule implementation for quality assurance
āœ… ETL Pipeline - End-to-end reproducible workflow

Data Science Best Practices

āœ… Reproducible pipeline with clear documentation
āœ… Raw data preservation (never modify source data)
āœ… Systematic missing value handling
āœ… Feature engineering guided by domain knowledge
āœ… Data quality validation at each stage

This repository reflects practical data science work, where 80% of effort goes into understanding and preparing the data correctly, not just model building.

šŸ“œ License

This project is released under the CC0 (Public Domain) license.
You are free to use, modify, and distribute the data and code without restriction.

šŸ¤ Contributions & Feedback

Suggestions, issues, or improvements are always welcome!
If you use this dataset in a project, feel free to share your work — I'd love to see it.

Possible Enhancements

  • Add more brands (international markets)
  • Include battery specifications
  • Add weight and dimensions
  • Include launch dates for trend analysis
  • Expand connectivity features (WiFi, Bluetooth versions)

šŸ“§ Contact

For questions or collaboration:

⭐ If this project helped you, consider giving it a star!

READY FOR THE NEXT ONE?
Discussion00

Sign in to join the discussion.