· 5 min read

Practical Issues In Game Data Mining

There are a range of important issues to consider when planning to or performing collection of game telemetry and mining of this type of data. For example, confidentiality of user data and effective pre-processing approaches are among the most important.

Transparency

The patterns discovered by data mining tools are useful only if they are interesting and understandable to the user they are aimed for. Any data mining result (model) should be as transparent as possible, i.e. the result should describe a pattern that is intuitively interpretable and which is followed by an explanation, targeted at the specific stakeholder or user of the result e.g. decision trees are intuitive and almost self-explanatory in terms of their results, but neural networks are comparatively opaque to the non-expert (as are non-linear models in general).

For example, a game designer may not be a statistics expert and therefore providing the results of a variance analysis in the standard statistical reporting form (a series of values), will not be conducive to the designer understanding the result and being able to act upon it. Transparency is vital to ensure that the various users of game data mining results are able to understand and act upon them. Another issue in visualizations is screen real estate, information rendering and user-pattern interaction. Interacting with raw data or mining results is important, because it provides the means for users to focus and refine the mining tasks. Additionally, it allows users to model the discovered knowledge from different angles or conceptual levels.

 

Data Cleaning

Data analysis can only be as good as the data that is being analysed, and most algorithms assume the data to be noise-free. This is an important assumption. Depending on the technical back end, game telemetry data may be more or less complete or saddled with different types of problems. Data cleaning (or cleansing) is the process of detecting and removing inconsistencies from data, towards improving and ensuring the quality of the data.

Quality problems in raw data come in many forms, e.g. misspellings during data entry, missing information or the presence of invalid data. When multiple sources of data are integrated, for example in a data warehouse, or analysis run across multiple data sources (e.g. telemetry from different games), the requirement for careful data cleaning increases due to the potential for error introduced when datasets are combined.

Performing data mining on low-quality data (“dirty data”), with, for example, missing or duplicate information, can compromise the validity and accuracy of the results, or even worse, can lead to outright wrong results, following the “garbage in, garbage out”-principle in data mining. As a consequence, data cleaning and data transformation (commonly referred to as pre-processing) is vital, but is often erroneously viewed as lost time. As frustrating as data cleaning may be, it is one of the most important phases of the knowledge discovery process. Data cleaning is a complex topic. Unfortunately, it is not possible to provide simple guidelines to address this topic. There is also a general lack of research in the area despite the importance.

[bctt tweet=”Performing data mining on low-quality data (“dirty data”), with, for example, missing or duplicate information, can compromise the validity and accuracy of the results, or even worse, can lead to outright wrong results.” username=”GameAnalytics”]

Performance and Sampling

Many methods for data analysis and interpretation were not originally designed for the very large datasets that exist today. In game de-velopment, telemetry datasets easily reach the terabyte size for online social games or for large commercial games with hundreds of thousands or millions of players. In addition to the size of the data, the dimensionality of the data, i.e. the number of variables in the dataset (e.g. the number of variables such as completion time, class, level, etc., known for each player in a game), is decisive to the choice of data mining techniques. In general, the search space grows exponentially with the number of dimensions in a dataset, and its effect is so dramatic that it is currently one of the most important research problems in data mining.

Many techniques have issues with scalability and efficiency at large scales and dimensions, especially those that scale quadratically with dataset size, or algorithms with exponential or polynomial complexity. Sampling is a possible solution, i.e. mining part of the dataset rather than the whole, and extrapolating results from the sample to the whole dataset. Sampling has its own complexities and challenges, for example in relation to ensuring a representative sample that captures the features of the entire dataset. Another approach is parallel programming, where the dataset is subdivided and results for each subset merged later.

 

Security

It’s an important issue with any game telemetry data collection, whether intended for low-level work or strategic decision making. Game telemetry data are generally considered confidential in the industry, and should be kept safe, which includes considerations on how to handle data access, transfer of data and transfer of results.

 

Social and Privacy Issues

One of the key issues in data mining is the question of individual privacy. The immense collections of data on people, and the many opportunities for collecting additional information, combined with data mining, makes it possible to analyse, e.g., routine business transactions, and obtain a substantial amount of information about the habits and preferences of individuals or businesses. Additionally, when data is collected for player profiling, behaviour, correlations of personal data with other information, and so forth, sensitive and private information about individuals or businesses is collected and stored. This is controversial given the confidential nature of such data, and the potential illegal access to it. Another issue is how the data is being used. Because this type of data is valuable, databases of all kinds are traded. It is, thus, important to be aware of what data and analysis results that are being distributed, e.g. email addresses of players.