
Project 1: Large Retail Sales Dataset Analysis
Project Background
​
The following is an analysis project undertaken based on retail transactional data that encompasses:
-
301,010 transactions
-
159,335 customers (ranging from 18-70 years of age; 62% male, 38% female)
-
Five (5) countries (USA, Germany, Australia, UK and Canada)​
​
This analysis was conducted by myself individually on the below dataset:
https://www.kaggle.com/datasets/sahilprajapati143/retail-analysis-large-dataset/data
​
This dataset contains numerous data points from:
-
customer demographic information (name, email address, country, phone number address, gender etc) to
-
purchasing information (order quantity and dollar figure totals) through to
-
product category and transaction information
​​
The data structure is defined in more depth below.
This large dataset was not without its errors; the pre-uploaded dataset to MySQL had numerous NULL values and blank fields, inconsistencies in column headers as well as errors in the data fields themselves more generally and formatting issues that I have addressed with the use of Excel and SQL queries to correct and standardise.
​
The project was conducted in two (2) phases:
-
Data Cleaning: Conducted various cleaning and standardisation tasks in Excel and SQL to ensure the best possible accuracy, authenticity and reliability of the analysis, in order to best practically apply its insights.
-
Exploratory Data Analysis (EDA): This phase uses specific SQL Queries and Tableau visualisation to take a deeper dive into what insights can be gleaned from the data in alignment with business metrics from a sales, customer and product standpoint.
The intention is to provide sales professionals/managers and/or employees operating within a retail context, actionable recommendations that they can use to focus their efforts to be more profitable.
Please see this repository for a detailed list of data cleaning steps taken and SQL reporting queries/scripts built: ShaneHoffman_GlobalRetail_DataAnalysisProject
​​
A pdf copy of the report including the SQL scripts and corresponding reporting can also be downloaded here: Global Retail Sales Final Report
The analysis is focused on the following key areas:

1
Sales Performance Insights
-
What are the top revenue generating products?
​
-
Which countries have the highest numbers of purchases?
-
What is the average order value (AOV)?
-
What is the average order quantity?
2
Customer Insights
-
Do customers follow a particular: gender, age and/or income pattern or trend?
-
What is the percentage split of customers by gender, income level?
-
What about the average age of customers?
-
How much revenue is driven by repeat vs one-off/new customers?
3
Product (Market) Trends
-
Which products are underperforming?
-
Which product categories are performing? and which are not?
-
Are there particular brands performing better/worse than others?
Key Insights Obtained
Sales Performance Insights
-
Bottled water products make up over 80% of the top 20 best-selling products in 2023 and 2024
-
The next highest selling product not in the bottled water grocery category has over 65% less spend than its preceding bottled water product on the top 20 rankings for 2024 and over 76% less spend that it’s bottled water predecessor in 2024!
-
From an Australian perspective, products are more evenly split however, bottled water products still make up over 32% of the top 20 revenue generating products in the country ​​​
​
On a mobile device view dashboard here: Sales Insights
Customer Insights
-
55% of the top 20 (and 43% of overall) spending customers are medium income earners
-
Customers in the age ranges of 18-25 and 41-70 represent 37% and 37% = 74% of the overall customers
-
2 out of every 3 customers do not repeat purchase
​
On a mobile device view the dashboard here: Customer Insights
Product (Market) Trends​
-
As a brand, Pepsi is an outlier, most pronounced in a USA context, being the:
-
Highest selling brand per units sold: 172% higher than the next most purchased brand by units sold
-
Highest revenue generating brand: 168% higher than the next highest brand
-
-
Products in the grocery and electronics categories together make up 60% of the top 20 revenue generating products
-
Products with less than 10 units sold and less than 5 customer rating tend to be very niche products such as very specific decorations, types of coffee and very specific types of clothing​
​
On a mobile device view the dashboard here: Product Insights
Business Recommendations
The analysis of this data brings with it many new actionable suggestions and strategies that this analyst
can now make to businesses either in, or adjacent to, other businesses in the retail industry. These
include:
-
Doubling down/focusing on popular brands such as Pepsi, Whirlpool, Coca-Cola and Adidas further investing in holding more stock of these items due to their clear, outlying popularity
-
Utilising the popularity of these brands in cross sell campaigns with other less popular brands
selling lesser popular brands such as Nestle by doing targeted cross sell campaigns offeringFor example: US retailers could experiment with leveraging the popularity of Pepsi as a brand in to increase the less popular brands’ sell through rates. -
Pepsi products bundled with Nestle products and trial these campaigns for a measurable period, to evaluate their effectiveness and whether they will be continued
-
Focusing marketing efforts on male, medium income earners in the 18-25 and 41-70 age
brackets as they represent the majority of customers -
Or conversely tailoring marketing efforts to sell more to less represented markets (in this
reporting) such as females and the 26–40 year age range -
Given that 2 out of every 3 customers is just a once-off buyer – retailers should also focus on
campaigns around repeat purchasing e.g. offering rewards/incentives for repeat purchases in
follow up transactions -
Retailers can also re-evaluate those products with less than 10 units sold and less than 5/10
customer rating e.g. those niche products such as very specific decorations, types of coffee
and very specific types of clothing – and evaluate whether they can discontinue these items and
reduce costs in holding inventory around these that can be freed up for re-investment back into the business or to focus on more popular products