Tensorial Principal Component Analysis in Detecting Temporal Trajectories of Purchase Patterns in Loyalty Card Data: Retrospective Cohort Study

Background Loyalty card data automatically collected by retailers provide an excellent source for evaluating health-related purchase behavior of customers. The data comprise information on every grocery purchase, including expenditures on product groups and the time of purchase for each customer. Such data where customers have an expenditure value for every product group for each time can be formulated as 3D tensorial data. Objective This study aimed to use the modern tensorial principal component analysis (PCA) method to uncover the characteristics of health-related purchase patterns from loyalty card data. Another aim was to identify card holders with distinct purchase patterns. We also considered the interpretation, advantages, and challenges of tensorial PCA compared with standard PCA. Methods Loyalty card program members from the largest retailer in Finland were invited to participate in this study. Our LoCard data consist of the purchases of 7251 card holders who consented to the use of their data from the year 2016. The purchases were reclassified into 55 product groups and aggregated across 52 weeks. The data were then analyzed using tensorial PCA, allowing us to effectively reduce the time and product group-wise dimensions simultaneously. The augmentation method was used for selecting the suitable number of principal components for the analysis. Results Using tensorial PCA, we were able to systematically search for typical food purchasing patterns across time and product groups as well as detect different purchasing behaviors across groups of card holders. For example, we identified customers who purchased large amounts of meat products and separated them further into groups based on time profiles, that is, customers whose purchases of meat remained stable, increased, or decreased throughout the year or varied between seasons of the year. Conclusions Using tensorial PCA, we can effectively examine customers’ purchasing behavior in more detail than with traditional methods because it can handle time and product group dimensions simultaneously. When interpreting the results, both time and product dimensions must be considered. In further analyses, these time and product groups can be directly associated with additional consumer characteristics such as socioeconomic and demographic predictors of dietary patterns. In addition, they can be linked to external factors that impact grocery purchases such as inflation and unexpected pandemics. This enables us to identify what types of people have specific purchasing patterns, which can help in the development of ways in which consumers can be steered toward making healthier food choices.


Interpretation of the Tensorial PCA Components and Directions
We next give examples of interpreting the tensorial PCA components and directions.Assume for simplicity that we have a total of p = 3 different items and t = 4 time points and that we have chosen to retain p0 = 2 principal item directions and t0 = 2 principal time directions (attaining a dimension reduction from pt = 12 elements to p0t0 = 4 elements).Let the directions be as follows: .
The principal components  =    ̅  (where we have dropped the index i for convenience) then take the form: 1,1:4 1 2 ( 1,1:2 −  1,3:4 ) 1 2√2 ( 2,1:4 −  3,1:4 ) 1 2√2 ([ 2,1:2 +  3,3:4 ] − [ 2,3:4 +  3,1:2 ]) ), where xk,s:t denotes the sum of the elements on the kth row of Y over the columns from s to t, that is, the total (centered) purchase amount of the product k over the time points from s to t.The interpretations of the four components in  are as follows: -The (1,1) principal component measures (and gets large values for people who have large) overall purchase amount of item 1.This is evident in the direction matrices by u1, the first column of , being dominated by the first item and v1, the first column of , having spread evenly over all four time points.This kind of pattern can emerge if the data contain an item, e.g.bread, that is typically purchased by all customers throughout the year.
-The (1,2) principal component measures the contrast in the purchase amounts of item 1 during the first and last two time points.This can be seen from the direction v2 having elements of equal magnitude but opposite signs.This kind of pattern can emerge if the data contain a product, e.g.
(non-frozen) berries, that is purchased less than average during a certain period of time and more than average during another period of time.
-The (2,1) principal component measures the contrast between the overall purchase amounts of items 2 and 3, i.e. someone with a large value for this component has purchased greater than average amounts of item 2 and less than average amounts of item 3.This kind of pattern can emerge if the data contain two items, e.g.vegetarian products and meat, that are typically not purchased together by the same customer.
-The (2,2) principal component is a combination of the previous two contrasts, i.e. someone with a large value for this component has purchased greater than average amounts of item 2 during the first two time points and item 3 during the last two time points, and vice versa.This kind of pattern can emerge if the data contain two products, e.g.ice cream and warm beverages, which are associated with two opposing time periods.

Count male female
Analysis sample Excluded card holders

A B
Analysis sample Excluded card holders

Correlation Structures of Products and Time
Our first step in the tensorial approach was to compute the modal covariance matrices for both modes; product groups (55 x 55 matrix) and weeks (52 x 52 matrix).Based on these, we obtained the modal correlation matrices where the diag function takes only the diagonal part of its input matrix.The elements of   and   can be interpreted as correlations between the product groups and correlations between the weeks, respectively.The two matrices are visualized in Supplementary Figure 1.Panel A shows that higher (lower) expenditure purchase behaviour tended to be fairly stable, as indicated by positive correlation across all weeks.Especially weeks next to each other correlated heavily, which is a sign of serial correlation (Supplementary Figure 1A).At the same time, the holiday weeks 12, 25, and 51 stand out, due to the understandably different purchase behaviour; these may include different products and amounts compared with everyday life.Similarly, the correlation analysis of expenditure on product groups showed a clear correlation between the expenditures of different product groups such as fruits, vegetables, cheese, milk and cream, sweets, chocolate, ready-to-eat foods, snacks, and soft drinks.Exceptions here are cigarettes, beer, and wine and cider, which can clearly be identified with no correlation or even a negative correlation with other product groups.Instead, they only correlated with each other and also slightly with snacks and soft drinks (Supplementary Figure 1B).

Detecting Atypicalities
Tensorial PCA allows conducting outlier detection simultaneously based on several dimensions [23,24,30].Here, we used it for identifying atypicalities within the data for both time and product dimensions.
From the loadings on weeks  ∈  52× 0 , we revealed patterns indicating changes along the season of the year and the weeks next to each other tended to have similar loadings (Supplementary Figure 3A).
As often observed also in standard PCA, the first component (i.e. the first column of the loading matrix) represents the "average" over the year, i.e. the loadings for all weeks are roughly equal [29].The second and the eighth components seemed to have a single week highly loaded, these being Christmas and Easter, respectively, and these could be interpreted as outlying times.Midsummer was highly loaded in components five and seven, but not alone.This might indicate similar purchase behaviour in some respects as in Christmas ("shared purchasing behaviour"; component 5) and opposite to Christmas ("differential purchasing behavior"; component 7).The nature of these similarities or differences in purchasing behaviour cannot be explored using the time dimension alone.The purchases during holiday seasons were clearly distinguished from the remaining weeks; the only weeks having an absolute loading higher than 0.5 are the weeks of Christmas (week 51), Easter (week 12) and Midsummer (week 25).Not surprisingly, tensorial PCA revealed that the purchase behavior within the holiday seasons is clearly different from other periods of the year, and these periods dominated the loadings of several components.
Simultaneously with estimating the time loadings, tensorial PCA also estimates the loadings  ∈  55× 0 for product groups, illustrated in Supplementary Figure 3B.The highest loadings revealed the product groups whose purchases differ most among customers and over time.Many of the products have very low loadings.As the first principal components contain the majority of the variation, we focused on them.Among these, beer and cigarettes clearly stood out with the highest loadings for the first two components.In the first component, beer and cigarettes loaded similarly, which means similarity in their purchase pattern.The second component also loaded highly on beer and cigarettes, but now in the opposite directions.This indicates the presence of purchasing patterns where one of them is bought (e.g.beer) with the other one absent (e.g.cigarettes).Additionally, wine and cider have the single highest loading in component nine.While all previous findings provide interesting insights into purchasing behaviours, these particular products (beer, cigarettes, wine, and cider) provide limited understanding of food purchase patterns as a whole, and thus, we regard them as outlying products for our purposes.

B)
A) Correlation Correlation

Figure S2. Correlation matrices of shopping data on weeks (A) and expenditures (B).
In A, the correlations are arranged by time, and overall there is a clear correlation in the shopping pattern throughout the weeks, and this correlation is even higher between weeks next to each other.Also, the weeks 12, 25, and 51 are outliers with smaller correlations than the others.In B, the product groups are arranged by similarity by means of hierarchical cluster analysis on, i.e. strongly correlating product groups are adjacent to each other.Beer, cigarettes, and wine and cider stand out in this analysis as they correlate with each other but not with other products.Color keys for the correlations given for both subfigures.

Figure
Figure S1: A) Comparison of the gender distribution of individuals in the study and excluded.Of the ones within study 66.7% were female, while of the excluded ones 68% were female.B) The mean age of the card holders within study was 47.0 years (SD 14.6) and in excluded 45.3 years (SD 14.7).

Figure S3 .
Figure S3.Several first columns of the loading matrices for weeks (A) and for product groups (B) illustrated using heatmaps.In Figure A, the weeks 51, 25, and 12 clearly stand out.In Figure B, most of the product loadings are very small, but some are clearly specific to a certain principal component.For example, cigarettes and beer have the highest absolute loadings for the first two PC, and wine and cider in the ninth PC.Color keys for the loadings given above both subfigures.

Figure S5 .
Figure S5.Next three pages give eighteen subfigures illustrating the various purchasing behaviour of individuals.Each figure shows the average money spent on each product and each week for the groups of people whose shopping behaviour, i.e.PCs for the specified week and product pair, is within 10% highest (left) or 10% lowest (right).Only the products with absolute loading higher than 0.3 at least for one of the product PC components are illustrated.Color keys for the loadings given above each subfigure.

Figure S7 .
Figure S7.Illustration of scree plots and the cumulative variance explained plots of the standard PCA results.On the left, data is arranged as weekly data, in which the spent money has been summed over all products, and Figure A is the scree plot of such data, while Figure D illustrates the cumulative variance explained.Similarly, in the middle, Figures B and E illustrate the results of product-wise data, and Figures C and F illustrate the results of the combined data.Dashed lines in each figure show the number of principal components that were detected as significant using the augmentation method.

Figure S8 .
Figure S8.Comparison between the principal component scores of individuals.The focus on tensorial analysis is the first time component, while the product component varies between 1 and 3 within figures (each product component corresponds to a single column of subplots).We aimed to determine whether the first three PCA scores of the products summed across years revealed the same information as product PCs 1-3 of the tensorial analysis.The diagonal plot (A) illustrates a clear correlation between the standard PC( PCprod=1) and tensorial PC ( PC1 -product average, PC1 -weekly average) as well as (E) standard PC(PCprod =2) and tensorial PC(PC2 -ready-to-eat, PC1 -weekly average).

Table S1 .
Names of products and descriptives of data.Median, minimum, and maximum across all customers show the euro expenditure per each 1000€ spent per each product.The last column gives the percentages of the money all card users spent on the product.

Median across all customers (€) Min (€ of each 1000€) Max (€ of each 1000€) % of total money spent Baby foods
a Canned foods include main dishes(e.g., meatballs, chicken soup, ratatouille, tomato soup), and pates (e.g., ham spread,  salmon spread).Canned vegetables include products like pickled cucumbers, olives, and preserved tomatoes, among others.bPigand bovine meat include beef and pork as whole meat products and minced meat.cReady-to-eatfood includes packaged and service counter sold ready meal portions (such as pasta Bolognese, salmon soup, and ham casserole), frozen meals(e.g., pizzas), and bakery products (e.g., meat pies, paninis)