Data comes in many forms, but a lot of the time people are focused with how that data is evolving over time. There’s been numerous projects I’ve done in the past that have had some kind of sales data that looked like:
Where we have a date/time value across all geographic locations, with multiple SKUs and some kind of sales unit at the end of the day. Let’s say we’re tasked with seeing how this data trends over time.
One quick approach might be to see how each SKU trends over time per country. In this case we’re focusing on one country (Canada) and looking at the trends. We see some bumps here and there, maybe some seasonal effects, but we want a cold, hard number to tell our sales team how the stuff is growing over time. We want a regression line. So we could slap a regression model on each of these three lines and extract the equations that govern them. The catch is that this is all well and good for one country, but we have 100 more to go. What’s a better approach instead of us just making one regression plot after another in Excel or something?
Can’t Spell Rescue Without “R”
My go-to approach in the past with this kind of hierarchical data problem is to solve it programmatically. R is well positioned to solve this problem through the use of data grouping.
library(dplyr) inputData <- read.table("clipboard", sep="\t", quote="", header=T, row.names=NULL) inputData$datenumeric <- as.numeric(inputData$month)
First what I'm doing is loading in the data, then changing the month data from a form of "2014M1" to "1" so its possible for us to actually do our regression modelling later. Next we do some groupings in the data to prep it for summarization:
subsetColumns <- c("datenumeric"
subsetdata <- inputData[subsetColumns]
grouped <- group_by(subsetdata
Next we actually do the modelling part:
mods % do(linear = lm(units ~ datenumeric, data = .))
summarise(units_Max = max(units)
,units_Min = min(units)
,units_Var = (units_Max – units_Min)
,units_Avg = mean(units)
,units_Count = n()
,units_Pearson = cor(datenumeric, LBI, method=”pearson”, use=’complete.obs’)
%>% left_join(mutate(mods, units_Slope = coef(linear)))
z$linear <- NULL
So at the end of the day we have some data that looks like this:
What’s great about this form is that we have a country and SKU, but we also have descriptive statistics over the time period, how close to a linear trend it is (Pearson) and the linear growth rate over that period. For posterity, we have the count of months so we can see if we have any missing data. This table of data allows us to see which were the fastest linearly growing combinations of country and SKU and just how linear they were.
There’s a lot of assumptions here, namely that growth is suspected to be a linear trend. I feel like this is an ok assumption to start with, since we might use this to see what the overall trend is over a long period of time. If we wanted to check for non-linear growth or decline, we might use this as a baseline for further investigation.