A Lightweight Approach of Human-Like Playtest for Android Apps

Contributors: Yan Zhao, Weihao Zhang, Enyi Tang, Haipeng Cai, Xi Guo, Na Meng

In the video game industry, playtest refers to the process of exposing a game to its intended audience, so as to reveal potential software flaws during the game prototyping, devel- opment, soft launch, or after release. Game vendors sometimes recruit human testers from playtest platforms and pay testers money for game playing. Meanwhile, the mobile gaming industry has been growing incredibly fast. According to Sensor Tower, the worldwide spending in games grew 12.8% across the App Store and Google Play in 2019. By the end of 2019, 45% of the global gaming revenue came directly from mobile games; among all mobile apps, mobile games account for 33% of all app downloads, 74% of consumer spending, and 10% of all time spent in-app. The booming mobile game industry has led to a rapid growth in game testing demand, although it is always expensive and time-consuming to hire human testers to manually play games.

Researchers and developers proposed approaches to automatically test Android apps and video games, but the tool support is insufficient. For instance, random testing (e.g., Android Monkey) and model-based testing (e.g., AndroidRipper) execute apps-to-test by generating various input events (e.g., button clicks) to trigger diverse program execution. However, these approaches only recognize the standard UI controls defined by Android (e.g., Button and CheckBox). They cannot identify any customized playable UI items (e.g., the target and arrow shown in Fig. 1), neither do they use any domain knowledge to effectively play games. Some approaches adopt machine learning (ML) techniques to test games by training a model with lots of data. However, these approaches are heavyweight; they usually require for tremendous computation resources, careful hyperparameter tuning, and considerable ML expertise of users.

Fig. 1: A snapshot of the game Archery
To help general developers efficiently test Android games without using ML techniques, in this paper, we present the design and implementation of a lightweight game testing approach--Lit. As shown in Fig. 2, Lit has two phases: tactic generalization and tactic concretization. Here, a tactic describes in what context (i.e., program states), what playtest action(s) can be taken and how to take those actions.
Fig. 2: Lit consists of two phases: tactic generalization and tactic concretization

Phase I: Tactic Generalization

Phase I requires users to (1) provide snapshots of game icons and (2) play the game G for awhile. Based on the provided snapshots, Lit uses image recognition to identify relevant icons in a given scene. When users play G, Lit recognizes each user action with respect to game icon(s) and further records a sequence of ⟨context, action⟩ pairs. Here, context removes scenery background but keeps all recognized game icons. From the recorded pairs, Lit generalizes tactics by (1) identifying abstract contexts AC = {ac1, ac2, . . .} and major action types AT = {at1, at2, . . .} and (2) calculating alternative parameters and/or functions to map each abstract context to an action type.

Phase II: Tactic Concretization

Phase II takes in any generalized tactics and plays G accordingly. Given a scene s, Lit extracts the context c, and tentatively matches c with any abstract context ac ∈ AC involved in the tactics. If there is a match, Lit randomly picks a corresponding parameter setting and/or synthesized function to create an action for game testing.

For evaluation, we applied Lit, two state-of-the-art testing tools (i.e., Monkey and Sapienz), and a reinforcement learning (RL)-based tool to a set of game apps. Our evaluation shows that with an eight-minute user demo for each open-source game, Lit outperformed all tools by achieving higher test coverage and triggering more runtime errors. Specifically for CasseBonbons (a game similar to Candy Crush Saga), Lit achieved 79% branch coverage. On the other hand, Monkey, Sapienz, and RL-based tool separately got 1%, 33%, and 65% branch coverage. Lit triggered two runtime errors in tested games, while the other tools triggered none. Our experiments show that Lit is capable of testing games of three popular categories: match3, shooting, and basic board games. As there are hundreds of games belonging to these categories, we believe that Lit can tremendously help many game developers to efficiently test games and improve software quality.