Pacific B usiness R eview (International)

A Refereed Monthly International Journal of Management Indexed With Web of Science(ESCI)
ISSN: 0974-438X
Impact factor (SJIF):8.603
RNI No.:RAJENG/2016/70346
Postal Reg. No.: RJ/UD/29-136/2017-2019
Editorial Board

Prof. B. P. Sharma
(Principal Editor in Chief)

Prof. Dipin Mathur
(Consultative Editor)

Dr. Khushbu Agarwal
(Editor in Chief)

Editorial Team

A Refereed Monthly International Journal of Management

Data Analytics Sales Prediction Model


Avinash Dangwani

PhD Scholar

Department of Engineering (Computers)

Pacific University

Airport Road, Debari, Udaipur (Rajhasthan)



Dr. Chandansingh Rawat

Associate Professor

Dept of Electronics & Telecommunication Engineering

VESIT HAMC Collectors Colony,

Chembur Mumbai





In every New financial year Company propose Advertisement Budget to improve their sales. Estimation of Advertisement Budget is not easy task as it involves financial parameters. Managers are always interested to know prediction model for sales which is function of Advertisement expenses.

This paper will develop Sales prediction model using simple linear regression. The model will be built using the training dataset to estimate the regression parameters. The method of Ordinary Least Squares (OLS) will be used to estimate the regression parameters using python. Regression model will be validated to ensure goodness of fit before it can be used for practical application. The single variable regression is the limitation of this model. In future multiple variables can be calculated using multiple linear regression model using python.



Simple linear regression, Ordinary Least Square (OLS), Training & validation data, Sum square regression (SST), Total sum of squares (SSR).






This paper will develop sales prediction model using Simple Linear regression. sales prediction has two main methods(1) Qualitative method, (2) Quantitative method [3].Some of the Qualitative methods are Expert’s Opinion Method, Sales Force Composite Method, Survey of Buyer’s Expectations, Historical Analogy Method, Jury of Executive Opinions & Leading Indicators Method.

Some of the Quantitative methods areTest Marketing, Time Series Analysis, Moving Average Method, Exponential Smoothing Method, Regression Analysis&Econometric Models.

This paper will explore sales prediction using regression analysis due to its lower time complexity as compare to some of the other algorithm, Furthermore, these models can be trained easily and efficiently even on systems with relatively low computational power when compared to other complex algorithms. Building a regression model is an iterative process and several iterations may be required before finalizing the appropriate model [2]. Regression model is Organized in following sections.

  • Section – I: Simple Linear regression
  • Section – II: Ordinary least square(OLS) Method.
  • Section – III: Results& Model Diagnostics.
  • Section – IV: Conclusion


Simple Linear Regression


Simple linear regression (SLR) is a statistical technique which uses the existence of an association relationship between a dependent variable (outcome variable) and an independent variable(predictor/feature variable).

The functional form of SLR is as follows


                                      Yi = β0 + β1 Xi + εi                                                                     (1)




Yi =Value of the ith observation of the dependent variable

Xi = Independent variable of ith observation

εi = random error (residuals) in predicting thevalue of Yi

β0 & β1 = regression parameters

Ordinary least square (OLS) Method


Equation (1) can be re written as

εi = Yi - β0 - β1 Xi                                            (2)

The regression parameters β0& β1 are estimated by minimizing the sum of squared errors(SSE)

SSE =                                        (3)

The estimated values of regression parameters are given by taking partial derivative of SSE with respect to β0 & β1 and solving resulting equation for the regression parameters. The estimated parameters are given by








Where &  are estimated values of the regression parameters β1 &  β0 and  ,  are mean values of X & Y.


  1. Data Source

Sample Data is taken from Advertising Ratios & Budgets provided in annual report by Schonfeld & Associates, Inc [6]. which covers over 2,400 companies and 320 industries with information on fiscal 2018 and 2019 advertising& revenue spending.


For OLS Analysis total 52 sample companies data is taken from 12 different industries.


Table - 1 shows the sample percentage revenue growth & percentage advertisement growth for Electromedical & Electrotheraputic Appartus.


Table – 1:Data Source[6]: June 2020 Sample data of Advertising Ratios & Budgets from Schonfeld-Associates-Inc-v417 of Market


We will develop an simple regression model to understand and predictpercentage sales revenue growth on the percentage advertisement growth.


  1. Creating Feature Set(X) and Outcome Variable(Y) Using Python

The OLS model takes two parameters Y and X.In our example percentage advertisement growth will be X and percentage sales revenue growth will be Y.We will split data set into two sets, training & validation set. Trainng set will  be used to train algorithm to predict output. Validation set will be used to test accuracy & efficiency.


  1. Python for Building Regression Model

Python language is used as tool for building regression model for sales prediction. The statsmodel library is used in Python for building statistical models. OLS(Oridnary least square) API available in statsmodel.api is used for estimation of parameters for simple linear rgression model.


  1. Splitting the Dataset into Training and Validation Set

The data is divided into two subsets training data set and validation data set. The proportion of training dataset is usually between 70% and 80% of the data and the remaining data is used for validation data. We have taken train_size = 0.8which implies that 80% of the data will be used for training the model and remaining 20% will be used for validating the model. The records that are selected for training and test set are randomly sampled using python functions which returns four variables as shown below.

  1. train_X = feature values of the training set
  2. train_Y = response values of the training set
  3. test_X = feature values of the test set
  4. test_Y = response values of the test set


  1. Finding estimated parameters


Regression parameters &  are estimated from equations (4) & (5) using Python functions as tool.


  1. Fitting the Model

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the observed data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.


  1. Co-efficient of Determination (R-Squared / R2)

R-squared is a statistical measure of how close the data are to the fitted regression line. It is defined as

=                                                                  (6)


SSR = The sum squared regression (SSR) is the sum of the square residuals . Residual is the difference between observed value & estimated value  as shown below in Fig - 1.















Fig – 1: Residuals as function of Actual value & estimated value


SSR =  = e12 + e22 + e32 + e42                                          (8)

= square sum of variations w.r.t to estimated value

SST = total sum of squares is the sum of thedistance the data is away from the mean (central tendency) all squared as shown below in Fig - 2.














Fig – 2: Residuals as function of Actual value & Mean value


SST =  = ( y1 – )2 + ( y2 - )2+ ( y3 - 2 + ( y4 - )2 (9)




The above equation indicates that R2 is directly proportional to difference between the square sum of variations in y w.r.t  mean and square sum of variations in y w.r.t estimated value.


Not good fit:


Smaller R2 value indicates that SSR value is large and close to SST which indicates that variation in y w.r.t estimated value is large & close to variation in y w.r.t   mean, which is not good fit.


Good fit:


Large R2 value indicates that SSR value is very small (actual values of y are close to estimated values of y) and not close to SST, which indicates that variation in y w.r.t estimated value is not close to variation in y w.r.t mean, which is a good fit.


Results & Model Diagnostics


  1. Estimated Model

Using python as tool parameters of regression model are calculated as shown below.


Using 80% training data set


  • Constant =  101
  • Regression coefficient = 0.160

The estimated model can be written as


Yi = β0 + β1 Xi(11)



Rev Grw(%) = 6.101 + 0.160 * ( Ad Grw(%) )


  1. Interpretation of Estimated Model

Model estimates that 1% Ad Growth will increase Revenue by 0.160 %. For example, if the sales revenue was 2 Million in year 2018 then according to our model sales revenue in year 2019 will increase by 0.0032 million i.e. estimated sales revenue can be 2.0032 millions that is rise of 3200/- in revenue.

  1. Model Diagnostics (Validation)

Before using regression model in practical applications, it should be validated & tested for goodness of fit. We will be using Co-efficient of determination (R-squared) method to determine goodness of fit. Using python as a tool following value of R2 is calculated



   R2 = 0.208


According to Cohen – 1992 [9] r-square value 0.12 (12%) or below indicate low, between 0.13 (13%) to 0.25 (25%) values indicate medium& 0.26 (26%) or above values indicate high. Our model explains 20.8% of the variance in the validation set, so it is reasonably good fit.


The simple linear regression model using ordinary least square (OLS) method shows functional relationship between the outcome variable (Sales revenue growth in %) and the feature (advertisement growth in %). The model validation is investigated using R2 technique to ensure goodness of fit.while an R-square as low as 10% is generally accepted for studies in the field of arts, humanities and social sciences because human behavior cannot be accurately predicted, therefore, a low R-square is often not a problem in studies in the arts, humanities and social science field. There are various other control parameters which affects the value of R-square. Therefore, in order to extend scope of this research various social science characteristics like age, gender, motivation towards product and festive season should be included as control variables in analysis.




[1] Core Python Programming by Dr.R.Nageswara Rao second edition dreamtechPress.


[2] Machine Learning Using Python by Manaranjan Pradhan & U Dinesh Kumar first reprint

edition Wiley publications

[3] Sales prediction types available online at URL:

[4] Advantages of Linear regression model available online at URL:

[5] Will Koehrsen Article howtosetupyour-machinelearningproblem can be found at

following URL:


[6] Data source of Ratios & Budgets can be found at following URL:




[9] Cohen’s Conventions for Small, Medium,and Large R2 values can be found on  

following URL


[10] Small is beautiful. The use and interpretation of R2 in social Research by Ferenc Moksony Pages 6 & 7