*[This post was written in collaboration with Christian Bauckhage and Rafet Sifa.]*

In this – somewhat wordy – post, we introduce the foundations of cluster analysis, and discuss its application in finding patterns in the behaviors of players, and developing profiles of how people play your game.

Recent years have seen a deluge of behavioral data from players hitting the game industry. The reasons for the data surge are many, including the introduction of new business models, technological innovations, the popularity of online games and the increasing persistence of games.

Regardless of the causes, the proliferation of behavioral data leads to the problem of how to derive and implement insights from them. Behavioral datasets can be very big, time-dependent/-sensitive and high-dimensional.

**Clustering** – or cluster analysis – offers a way to explore such datasets and discover patterns in them that can reduce the overall complexity of the data. Clustering and other techniques for profiling players and play behavior have therefore become popular in the budding field of game analytics.

To take a quick example, let’s say you are dealing with a team-based FPS, and want to investigate how the game is being played, focusing on various gameplay metrics and retention. You collect metrics on kill/death ratios, time spent in particular modes (on foot, driving, being a passenger etc.), various other event data, and churn/retention data. Running a cluster analysis, you find several different clusters, but note that the majority of the players who churn out of the game rapidly belong to one of the clusters. Investigating this cluster in more detail, you find that these are players who spend all of their time driving vehicles, and very little time on foot. They also only play the few maps that are vehicle heavy, and quickly leave after having tried these maps a few times. It would be reasonable to suspect that this group of players love vehicle based combat, but that the low availability of vehicles in the game means they quickly churn out. The solution could be to try to add more maps with vehicles, or more things to do with the vehicles. More detailed analysis can also be performed trying to uncover further details, and some of the players in question may be even contacted and asked.

Different solutions could be envisioned in this case and they can be tried out using split-level testing. As with most other forms of user research, we find a problem and some information about why the problem occurs, but in order to verify it, we need to talk to the users. However, knowing that a problem exists can be enough to devise solutions and test them.

Clustering is imminently useful for categorizing your players and for getting an overall idea about the variance in player behavior and how behavior is organized, but also for detailed analysis. To present everything cluster analysis can do is vastly out of scope for this post, but suffice to say that clustering is used in every data context we have ever heard about, and that thousands of papers and books have been written on this foundational analysis. With that in mind, it should be readily apparent that the approach is useful in games, as well!

Clustering techniques however require expertise to use, and an understanding of the games in question is essential to evaluating the results of clustering.

This post is the first in a series of four that will cover the fundamentals of cluster analysis for player behavior, and the background for why these kinds of techniques are useful for game analytics. Here is an overview of the whole series:

- In the first, we introduce the foundations of cluster analysis. We introduce the basic theory behind clustering and its application in game analytics. Because we want to get around the basics, this is going to be a long post (a solid lunch break), but hopefully worth the read. We are, in part, building this on different introductory sources, e.g. Han and Kamber’s introduction to data mining.
- In the second post, we describe several classes of algorithms and discuss what they respectively are good and bad at, and provide references for finding out more about these.
- In the third post, we describe common pitfalls and underlying assumptions of working with cluster analysis in games.
- In the fourth post, we describe specific examples of applying cluster analysis on behavioral data from games. Until then, here is a paper that describes examples from Battlefield and Tera; and here we look at a previous Tomb Raider game. An example of cluster analysis employed as a function of time can be found here.

For an introduction to the foundations of game analytics in general, there is a list of some useful books here. For a good introduction to statistics in general, this book is recommended.

This series is aimed at people who understand basic statistics or have some experience with the fundamentals of game analytics, or those who are just looking for inspiration on new ways to derive actionable insights from behavioral data.

## Reducing dimensionality, finding patterns

Cluster analysis is fundamentally a dimensionality reduction technique. What this means is that it can take datasets that have many dimensions (e.g. behavioral variables/features), and locate the dimensions that matter the most, i.e. the behavioral variables along which the players are aligned.

This kind of reduction of the overall dimensionality of the search space can be performed using a variety of techniques, including descriptive methods such as feature selection combined with segmentation and unsupervised/supervised learning techniques. Collectively, these are valuable for finding patterns in behavioral game data, and developing groupings or profiles that provide insights for the game development process.

One way to deal with situations where high-dimensional game behavioral data need to be analyzed is clustering. This because clustering as an unsupervised method permits exploration of the data space, finding groups of players exhibiting similar behaviors, and identifying the behavioral features that inform them.

Originally developed for Anthropology and Psychology, cluster analysis is today widely applied across a number of fields and is a common approach in data mining and statistical analysis.

Cluster analysis has been readily adopted in game analytics as a method for finding patterns in player/customer behavior, but also to compare and benchmark games, evaluate performance of infrastructure and for designing and training artificial agents and game AI.

However, explorative unsupervised methods, such as clustering, require expertise to be used correctly, combined with a deep understanding of the context they are being employed in.

Today, it is possible to run a variety of cluster algorithms using statistics and analytics tools, without knowledge of the underlying process. However, without a clear understanding of how the algorithms work and the implied assumptions, the conclusions derived from such analysis risk being wrong.

Furthermore, cluster analysis is an active field in its own right, and there are a number of unresolved research issues, which means that validation of cluster structures can be challenging and require expertise.

Additionally, without knowledge of the game under examination, its mechanics, reward structures and monetization mechanisms, as well as the business around the game, making informed choices throughout the steps of a clustering analysis – from feature selection, pre-processing of data to visualization and explanation of results – runs the very real risk of leading to useless or dangerously misleading results.

## Foundations of cluster analysis

Cluster analysis fundamentally refers to the process of grouping sets of objects in such a way that objects assigned to the same group, called a cluster, are more similar (in some sense or another) to each other than to those in other clusters. In the current context, the objects are players or artificial agents, each described via a finite set of features. For example how often a player dies, their hit/miss ratio with weapons, character level, movement strategy, etc. Feature sets can be complex, e.g. include different data modalities.

The term cluster analysis refers to a process, and the actual algorithms used to develop clusters vary substantially. There are a number of algorithms, each with different strengths and weaknesses, and with different traditional application areas in different fields. Therefore, the definition of what a cluster is also varies between algorithms, and while all somehow refer to a group of objects in a data space, understanding the different cluster models is vital to the correct application of the different algorithms available.

Objects are generally represented as points (vectors) in the multi-dimensional space defined by the data. Each dimension is one feature (or attribute, variable). Conceptually, data for clustering are thus represented as an m by n data matrix, where there are m rows (one per object) and n columns (one per feature). The alternative to the data matrix is a proximity (or distance) matrix, which is an m by m matrix containing the pairwise similarities (dissimilarities). More complex data structures than feature vectors can also be used, e.g. graphs or text strings. One of the first operations typically necessary in a game context is standardization and transformation of the features. This both to normalize data, bring them into the same scale, or to reduce the number of dimensions.

Importantly, across all algorithms, the assigning of objects to clusters can be either “hard” or “soft”. Hard assignment means that an object is placed with a cluster completely (e.g. k-means), whereas soft assignment means that the model provides information about the degrees to which an object belongs to different clusters. More detailed divisions of the school of clustering algorithms (over 100 are known, and many variants of individual algorithms exist) is possible, see here for an overview.

The exact process of assigning objects to clusters depends on the clustering algorithm chosen. For example, the Euclidian distance between the objects in the variance space. Other approaches focus on finding areas in the data space with dense collections of objects, or uses specific statistical distributions.

Clustering is, importantly, not an automatic process, but rather an iterative process of knowledge discovery that requires choice and comparison of algorithms, defining and optimizing values such as density thresholds, the number of clusters, etc. Modifying the parameters of the model generated by a cluster analysis should be done with care because the same dataset can lead to different outcomes depending on how the parameters are set. This also means that there is no “correct” algorithm, but rather a question of iteratively trying out different algorithms – that are designed to work on the kind of model in question – until a good fit is found. Clustering remains a highly user-dependent process, with the first task being identifying the type of model needed, and then iteratively working with algorithms and parameter tweaking.

*In the next post, we describe several classes of algorithms and discuss what they are good and bad at, and provide references for finding out more about these. *

**Acknowledgements**

We are indebted to several colleagues for sharing their insights and feedback on this post, including but not limited to Christian Thurau, Fabian Hadiji and Shawn Connor.