InnovaLab 2025 - ML Hackathon - Lakitus

Car Quality Predictor by Lakitus

App Link: Streamlit App by Lakitus
GitHub repository: car-quirks-ml-InnovaLab25

Abstract

This project was presented at InnovaLab 2025, a machine learning hackathon whose challenge was to predict a vehicle's quality based on certain characteristics. The organizers provided a CSV file with the necessary data, which we used to build and train our model. We used Python and machine learning tools focused on model generation and training. The result was a robust model capable of predicting a vehicle's quality based on "High," "Medium," or "Low."

Purpose

In this hackathon, our objective was to build a machine learning model capable of predicting a car's quality category - "Low," "Medium," or "High" -based on its specifications. We received a dataset of 10,000 vehicles, each described by characteristics such as year of manufacture, mileage, fuel type, engine power, safety rating, and fuel efficiency. The target variable, car_quality, was unbalanced (approximately 84% "Medium," 10.5% "High," and 5.5% "Low"), which required careful handling during training and evaluation. Our motivation was twofold: first, to demonstrate how predictive analytics can help buyers and dealers quickly assess vehicle quality; and second, to offer an interactive and explainable web application where users can enter car specifications, obtain an immediate quality prediction, and explore the model's reasoning using SHAP values ​​and what-if sliders.

Lakitu Stack.png|600

Data Loading and Exploration

We began by inspecting the provided CSV file, which contained 10,000 rows and the following fields: name, year, selling_price, km_driven, fuel, estimated_fuel_l, seller_type, transmission, owner, body_type, engine_power_hp, safety_level, car_quality, efficiency_km_l, and quality_score. Since there were no missing values, we initially focused on understanding the distribution and relationships between these variables. In particular:

This exploratory step confirmed data cleanliness, highlighted class imbalance, and guided our decision to design additional features (such as extracting a brand name) and discarding quality_score to prevent label leakage.

Data Preprocessing

Before training any models, we cleaned and transformed the raw fields into features that the pipeline could consume. Key steps include:

Model Selection

We evaluated several classifiers to find the best balance between accuracy and interpretability:

Overall, XGBoost, after tuning, offered the best balance between speed, accuracy, and robustness, so it was selected as the final model for our Streamlit application.

Model Evaluation and Interpretation

We used two approaches to understand how our XGBoost model makes decisions:

Web Deployment with Streamlit

We created a single-file Streamlit application (app.py) that integrates our XGBoost pipeline into an interactive interface. Key components:

In short, the Streamlit app runs locally or in Community Cloud as a fully interactive dashboard that predicts car quality, explains the prediction using SHAP, and allows users to explore what-if scenarios.

Final Takeaways

Thank you for reading! keep in touch for similar projects.