· 4 min read
Feature Selection and the Law of Diminishing Returns
Anders Drachen
Anders Drachen, Ph.D. is a veteran Data Scientist, Game Analytics consultant and Professor at the DC Labs, University of York (UK).
A problem that recurrently mentioned during the recent Data Science Day in Berlin is feature selection: given the array of possible variables/features to track from a digital game, which of these should we track? The solution I heard mentioned most often was: track everything, analyze everything. However, this approach is not without its problems, notably in terms of the resources it requires to analyze everything. Another thing to consider is the law of diminishing returns.
Limited resources In a situation with infinite resources, it is possible to track, store and analyze every single user-initiated action – every fraction of a move of an avatar, every button press, all purchases made, every single chat message, all the server-side system information – even all keystrokes. Doing so will likely cause bandwidth issues, and will require substantial resources to add the message hooks into the game code, but in theory, this brute-force approach to game analytics is possible.
However, it leads to very large datasets, which in turn leads to huge resource requirements in order to transform and analyze them.
For example, tracking the weapon type, range, damage done, target, whether the target was killed or not, the weapon modifications chosen by the player, the position of the player and target, the trajectory of the bullet, etc. will provide the possibility for a very in-depth analysis of weapon use in an FPS. However, the key metrics to calculate in order to evaluate weapon balancing could just be range, damage done and the frequency of use of each weapon. Adding a number of additional features may not add any new relevant insights, or may even add noise or confusion to the analysis.
Similarly, it may not be necessary to log behavioral telemetry from all players of a game, but only a percentage.
[bctt tweet=”the key metrics to calculate in order to evaluate weapon balancing could just be range, damage done and the frequency of use of each weapon. Adding a number of additional features may not add any new relevant insights, or may even add noise or confusion to the analysis.” username=”GameAnalytics”]
Selecting core features In general, if selected correctly, the first variables/features that are tracked, collected and analyzed will provide a lot of insights into user behavior. As more and more detailed aspects of user behavior are tracked, costs of storage, processing and analysis increase but the rate of added value from the information contained in the telemetry data diminishes.
What this means is that there is a cost-benefit relationship in game telemetry, which basically describes a simplified theory of diminishing returns: Increasing the amount of one source of data in an analysis process will yield a lower per-unit return.
A classic example in Economic literature is adding fertilizer to a field. In an unbalanced system (under-fertilized), adding fertilizer will increase the crop size, but after a certain point this increase diminishes, stops and may even reduce the crop size. Adding fertilizer to an already balanced system does not increase crop size, or may reduce it.
Fundamentally, game analytics follows a similar principle. An analysis can be optimized up to a specific point given a particular set of input features/variables, before additional (new) features are necessary. Additionally, increasing the amount of data into an analysis process may reduce the return, or in extreme cases lead to a situation of negative return due to noise and confusion added by the additional data.
No law without exceptions: for example, the cause of a problematic behavioral pattern, which decreases retention in a social online game, can rest in a single small design flaw, which can be hard to identify if the specific behavioral variables related to the flaw are not tracked.