Overview of the problem
Sales forecast is an old and ever-present problem for any retail company. For La Redoute, having an estimate for the coming week’s sales allows for a smother management of product stock across its various warehouses and is, as such, very important to help maintain product availability and lower product delivery times.
Due to a variety of internal and technical factors we decided to model our sales prediction problem as a weekly time-series forecast of two weeks into the future. This means that the final model would take as information the series of weekly sales during a specified period of time for each product and output a prediction for the sales of the current and following week.
Our problem ended up being defined as illustrated in Figure 1, where an arbitrary length of historical data is given to the model as an input while a sequence of length = 2 is received as an output, corresponding to the predictions for current week (w0) and the following week (w1).
Despite existing since the late 90’s (Hochreiter and Schmidhuber, Long Short-Term Memory, (1997)), LSTM’s have only gained notoriety in the mid 00’s, since when they have showed powerful results in various competitions and been integrated in products from companies such as Facebook, Google and Amazon. They are powerful Recurrent Neural Networks (RNN) that allow for the universal approximation of continuous recursive functions that can directly access representations of information obtained in previous iterations.
How LSTMs works
Figure 2: Schematic of the Gate Structure of an LSTM cell
As Figure 2 shows, an LSTM is used by recurrently accessing its structure (cell A in the image). This cell works in a similar way to a classical RNN with the addition of three filter-like structures (called “gates”) that control the information that is kept in the long-term and short-term memories of the cell as detailed in Figure 3. These “gates” are, in reality, small feed forward neural networks that learn by contact with samples from the input space what sort of information should be kept/discarded in what situations.
Figure 3 identifies a cell state, corresponding to the cell’s long-term memory. This long-term memory is responsible for remembering relevant information so that it can be accessed many iterations in the future. When a critical iteration comes, where information from the distant past needs to be accessed, the output gate will determine what information must be remembered to help with the current output. The result of the pointwise multiplication between the filtering done by the output gate and the information contained in the long-term memory produces both the output (read prediction) of the LSTM for that iteration as well as the information that will be kept in the short-term memory of the network and that will be accessed directly by the following iteration. The next iteration will receive the previous iteration’s long-term, short-term memory and the new inputs from the next sample. The short-term memory and the new inputs will be combined and passed to the forget gate. This gate will, based on the information it receives, learn what information to remove from the long-term memory of the cell. Finally, the input gate will learn to decide what information to send to the long-term memory of the cell by filtering the combination of short-term memory and input after they are passed by a tanh function.
All these elements combine to form a structure that can, autonomously, manage the information it receives, memorize and forget that which it considers relevant, in order to explore complex, time dependent patterns.
Application to our use case
After settling on the LSTM to be our algorithm of choice for this PoC, we made further modifications so that it would acquire new functionality that we considered important for our problem. One of those modifications was the shift into an encoder-decoder architecture with an attention layer added on top. Without getting into to many details about this specific mechanism, as it would escape the scope of the article, this attention mechanism (illustrated in Figure 4) allows the network to have one last look at all the predictions made for the previous weeks to help predict the week we’re most interested in, the following week (w1).
Figure 4: Encoder-Decoder architecture example for a chat-bot
As it is common in problems of this type, predicting that the next week’s sales are going to be the same as the previous week makes for a very strong baseline model. Our PoC managed to beat this baseline which gives us confidence in the fact that the data has enough signal to make meaningful predictions about the output.
Figure 5 : RMSE for products that sold more than 10 units each week. Significant p-value with a paired t-test.
Time-series analysis is a growing field that has shown a lot of potential in the past decade, which means that more research on the topic is to be expected. This will undoubtedly have huge impacts in its applicational fields like sales-forecast and means that, as new techniques and algorithms are developed, we can expect companies to be more capable of dealing with drastic changes in their respective markets before they ever happen.
Figure 1 : Towards Data Science
Figure 2 : Understanding LSTMs
Figure 3 : LSTM Step by Step
Figure 4 : Medium – Attention Mechanism