Skip to main content

Museums and the Web

An annual conference exploring the social, cultural, design, technological, economic, and organizational issues of culture, science and heritage on-line.

Levelling Up: Towards Best Practice in Evaluating Museum Games

Danny Birchall and Martha Henson Wellcome Trust; Alexandra Burch and Daniel Evans, Science Museum, UK; Kate Haley Goldman, National Center for Interactive Learning at the Space Science Center, USA


Museums make games because games can provide compelling educational engagement with museum themes and content, and the market for games is enormous. Truly understanding whether games are achieving your goals requires evaluation. In this paper, we identify the kind of games that museums make and use case studies of our own casual games to look at the benefits and means of evaluation. Beginning by identifying different kinds of evaluation within the broad framework of formative and summative practices, we suggest ways to plan an evaluation strategy and set objectives for your game. We then look in detail at evaluation methods: paper and wireframe testing, play-testing, soft launching, Google Analytics, surveys, and analysing responses “in the wild.” While we draw on our own experience for examples of best practice, we recognize that this is an area in which everyone has a lot to learn, and we conclude by suggesting some tactics for sharing knowledge across the museums’ sector.

Keywords: games, evaluation, learning, testing, best practice, research

1.   Introduction

Museums and other cultural institutions develop online games for many reasons. Firstly, web-based games effectively reach out beyond the walls of the institution and provide an experience of your content to an audience that can’t, or won’t, visit you. The online games community is huge, in the hundreds of millions (Casual Games Association, 2007), and a survey conducted by Ipsos MORI for the Science Museum in 2009 found that a quarter of young people aged 11 to 16 listed playing computer games as one of their three favorite pastimes (Ipsos MORI, 2009).

Secondly, learning is at the heart of the missions of our institutions, and online games are frequently excellent learning tools. Game mechanics can support learning (Cutting, 2011), while the virtual nature of a game can place users in different scenarios, allow them to see their actions played out, and give access to experiences that cannot be replicated in the real world because of expense or the scales involved. There is increasing recognition of the value of games. For example, a survey in 2006 (Ipsos MORI, 2006) found that 59 percent of teachers said that they would like to use computer games to support their teaching in the classroom, seeing games as providing motivating and engaging learning experiences.

Regardless of platform, museum games usually aim for “engagement” rather than being didactically educational or persuasive in campaigning for a particular cause. Evaluation is especially important in developing games that succeed in engaging the player though gameplay and subject matter, to test whether or not the specific objectives for the game have been achieved. Also, when games are designed for marketing the museum, it is vital to understand the extent that strategy succeeded.

For example, Rizk, developed by the Science Museum with Playerthree, and High Tea, developed by the Wellcome Trust with Preloaded, have both been extremely successful online games in terms of numbers and the favorable response from their audience and critics. Especially as both deal with difficult subjects (climate change science and the opium wars, respectively), failure was a real possibility. Evaluation allowed us to identify and understand the potential barriers to use, enjoyment, and learning. Evaluating is not simple, and is at times not cheap, but is easier and cheaper than fixing a finished game that doesn’t do what it should.

The case studies in this paper are “educational” and “persuasive” to lesser or greater degrees, but generally fall into the category of “casual but purposeful” games, as the bulk of museum games do. We aim to provide useful lessons from our own evaluation practice for museum games, and potentially more generalizable lessons for other forms of games.

Why evaluate games?

Evaluation may be generally agreed a worthy goal, but it is often not considered practical given the realistic constraints of time and money. Ultimately, all forms of evaluation are justified only to the extent that they provide you with information that enables you to make better decisions. Classic evaluation is broadly divided into two areas, according to the type of decisions they inform. Typically, formative evaluation takes place while you develop your product, with the aim of improving it; summative evaluation takes place after your product has been delivered, with the aim of gauging the impact. Techniques from both may also be used in a third, in-between area, remedial evaluation, in which the information gained is used to tweak the game post-launch.

Formative evaluation: Why hold up your tight game-development schedule by doing evaluation during production?

Project management involves juggling the famous three variables: time, cost, and quality. But how do you manage quality when you’re building a game? How do you know if a particular interface works well, or if you should spend more time improving it? How do you decide whether a nice extra feature is really worth the extra cash? How do you decide if your contractors have done their job well enough to be paid? You can’t avoid the questions; for answers, you either rely on your own judgement, knowing that you are not your target audience, or you have to measure. Without a solid evaluation strategy, you run the risk of delivering a game that is on time, to budget, and to specifications, but doesn’t fundamentally achieve its intention.

For a museum game, most of your objectives will not be about the game itself, but about users’ response to your game, so it is on this that you need to focus your attention. Sometimes you want to measure this directly: will people get my big idea? Sometimes you need to measure things that your objectives depend on: your users won’t get your big idea if they don’t enjoy the game enough to play it for a decent period; they won’t play it for very long if they can’t use it; and they won’t even try it if it’s not initially appealing. Formative evaluation is often used synonymously with usability testing, though usability is only one component of formative evaluation.

With a thorough framework for assessing quality in place, you have more control over your project. Far from the stereotypes of “design by focus group,” you can safely embark on more risky approaches: you can get closer to the boundaries if you can reliably tell when you’ve overstepped them. You can broaden the scope for creativity: if you can empirically test against your ultimate objectives, you can be more flexible in how you achieve them: “Why don’t we do it this way instead?” becomes a less-terrifying prospect. You can save time, money, and worry—in all three of the Science Museum’s major game-development projects, we have had examples of interfaces that we thought would require many iterations to get right, but in fact the audience understood straight away; we have also had examples of ideas that the audience stubbornly refused to understand, even though we thought they were quite straightforward.

If reach is a key objective—and many museum games are developed at least partly on the basis that games are popular—then you can begin to ensure that the people you are after really do want what you’re offering. You can manage situations where audience needs conflict: if you’re building a game for use in schools, for example, teachers may want it to appear reassuringly “educational,” but this may be counter-productive with their students; you need to find the optimal balance. Don’t assume that your audience is like you, and remember that your audiences change as technology, paradigms, and understandings change.

Summative evaluation: What’s the point in finding out about problems when it is too late to fix them?

The obvious answer is that you can learn for next time. It is good practice to reflect on the big questions around the project, not just within it. Was this a good idea in the first place? Does a game suit this type of content? Does it suit this audience? Can our processes be improved? Have we had any unexpected successes that we can deliberately target in future?

This last one is not to be sneered at: game-building is a creative activity, and the potential online audience in particular is large and unpredictable. As long as you stay within the scope of your organization’s mission, retrofitting success to an unexpected positive outcome can be a perfectly legitimate activity and can be useful for identifying future opportunities. In addition to gaining institutional knowledge, summative evaluation provides critical evidence for stakeholders, as games often meet particular scepticism within serious-minded organizations. Showing evidence that your project has met your organization and sponsors’ objectives makes getting funding and go-ahead easier in the future.

So this is the key: with game evaluation, the “how” needs always to be driven by the “why”; not just in “why evaluate?” but also in “why are we doing this in the first place?”

2.   Planning your evaluation strategy

Developing an evaluation strategy allows you to understand how you’ll create the highest quality of product that addresses your and the audience’s needs within the time and budget available.

Three key things need to be agreed on in developing an evaluation strategy for an online game:

  1. The target audience.
  2. Game objectives: what you are trying to achieve and why (covered in the next section).
  3. What you will do with your findings. Typically, at the Science Museum, findings from different stages are used to inform direction, positively influence development, provide information for funders, and provide findings that can be taken forward into future projects.

Most importantly, the strategy you devise has to be pragmatic: how can you maximise positive change with the time and money available? It is useful at this stage to ask the following questions:

  1. What do we already know about the audience, about their use of games, and about the barriers they encounter with respect to the subject matter? This is what we should build on.
  2. What don’t we know? This is what we should find out.
  3. What is likely to have the biggest impact on the success of the final product? This is what we should prioritize.

It is also important to work with the designers to ensure that there is sufficient scope for change built into the programme, to communicate what you need to test and why, and to understand when they are planning on developing certain aspects such as look and feel. Aligning your and the designers’ development timetables is critical, as is tracking the implementation of findings.

Setting objectives

Clarity in objectives leads to better design. The best strategy for refining the game comes before any form of evaluation, or even design and development. In-depth discussion of what the game is designed to do is necessary to ensure that project members have a shared understanding of the goals of the game. This discussion should focus on one question: “What should people think, feel, or do differently after playing this game?”

Common objectives for museum games include:

  1. To increase brand awareness for the museum.
  2. To entice non-visitors to come to the museum
  3. To engage players with museum themes or collections.  
  4. To encourage visitors to the museum to familiarize themselves with the institution and exhibitions, including lesser-known areas, as in a scavenger hunt.
  5. To deepen enjoyment at the museum, especially for the novice. An example of a poorly articulated but common objective would be: “The visitor will have fun.” A better objective might be: “The visitor will believe that our museum is a fun place to be,” or “The visitor will believe that we offer fun experiences beyond this game.”  
  6. To deepen observation of the collections and exhibition subject matter.
  7. To change visitor behavior in some way.
  8. To crowdsource museum needs, such as collections identification (Ridge, 2011).

Any one game can and likely will address several objectives, but it’s crucial that the team prioritize the primary objective of the game.  Without prioritizing, it’s easy to lose sight and allow scope creep.

3.   Formative evaluation

Classic formative evaluation for gaming consists of paper/wireframe testing or play-testing. This stage of testing is about identifying potential motivational, intellectual, and usability barriers and working with the designers to find creative solutions. You see what people say and do, and importantly gain understanding of what they don’t say and do, and why. One of the major advantages of this sort of game testing is that it can provide valuable information with respect to level design.

For each prototype, you should aim to test with a minimum of eight to ten members of each target audience. Each testing iteration requires at least two weeks, during which time the following must happen:

  1. The prototype is delivered and checked to ensure that it meets specifications, and that previous changes have been incorporated;
  2. Project team staff uses the game personally;
  3. Test sessions are conducted with target audiences;
  4. Project team staff writes up responses and feeds them back to the designers.

Where possible, meeting with the designers face-to-face is generally the most productive way to give feedback. Effective testing sessions promote questions to be resolved via creative conversations between stakeholders.

Game testing requires qualitative research methods—essentially, deep listening. You are seeking to understand the range and seriousness of the barriers encountered, the underlying reasons for those barriers, and what should be done to address them.

Paper, wireframe, and play-testing

For the purposes of this paper, both paper prototype and wireframe signify a visual framework of a screen or game element, containing the arrangement and type of content intended, and each of the functional elements of those screens. The focus is generally on what kind of content is available, the order that content is displayed in, and what the options are from that screen. Wireframes generally are not the final presentation of design elements such as images.

As wireframe tools have become more flexible and quick to produce, it is more common to do wireframe testing than actual paper prototypes (some tools, such as Balsamiq Mockups, even mimic the look and feel of paper prototypes).

For game testing, we typically ask for three prototypes with the aim of identifying the biggest problems at the earliest stages. For example, with Rizk the first round of testing focused on user understanding of the underlying concept and revealed that users thought that it was a game about ecology rather than risk management. These findings had a huge influence on all aspects of the game—including the look and feel, the language used, and the need for some game mechanics to be more explicit—since we knew that if we didn’t get this right, then no matter how enjoyable the game was, it wouldn’t have met our content objectives. Later testing focused on usability, and the final round provided information on in-game help, the introduction, and level design.

Earlier prototypes can be quite rough and ready: they need to convey the underlying concept, content, and type of game play, but can be as simple as a set of linked storyboards or even a paper-based game. As the development proceeds, more elements can be introduced into the prototypes so that they become increasingly like the finished product.

This type of testing, using a combination of observation and questionnaires, allows you to explore both what users are doing/failing to do and emotional responses, as well as gain insight into what users thought they were doing, why they failed to do other things, their response to the game, and what they thought it was trying to convey. Game testing can be conducted with individuals or with small groups of two or three users—with the advantage that conversation reveals useful information. Regardless of whether testing with an individual or a group, it is worth encouraging users to think aloud in order to reveal more about what they are doing and why that can be useful in providing more information.

It is also best to use a combination of free-choice and directed interaction. During Rizk testing, users played the game first without any direction from the evaluator in order for the evaluator to gain some sense of how this would actually be played. Users were then taken back to particular elements and stages in order for the evaluator to probe more deeply about their understanding, responses, and behaviors.

The key challenge in formative testing is timing it within the development cycle, as its most powerful advantage is the ability to provide useful information on the user interface, functionality, and flow of the game. Since formative evaluation normally takes a minimum of two weeks—longer if multiple sessions are intended—during the rapid process of game development, it becomes tempting to cut into the evaluation time for additional development time.  

Soft launch/Remedial evaluation

Evaluation feeding into your final product doesn’t have to stop at the point at which your product is “finished”: making changes to an online game while it’s being used by the public is more practical than it is with physical products such as exhibitions, and the unpredictability of online audiences can make post-launch changes in response to user feedback a useful way of optimising your game. This can take the form of a full public beta with a formal user-feedback process, which is quite common in the commercial gaming sector but rare with museums, which are often reluctant to publicly display “unfinished” work. A private beta is possible, of course, though this loses the advantage of being able to test with the full potential game audience.

On a smaller scale, it can just be a case of reserving some scope for development for the post-launch period and monitoring initial user feedback and usage. For example, post-launch feedback for High Tea led to the creation of a simple tooltips-based tutorial for the game’s early stages, as well as the correction of a minor bug.

Post-launch evaluation can be a particularly effective technique for perfecting level design, which can often take place without major reengineering of code and can be critical for the success of a game in drawing users into increasingly challenging learning activities. Monitoring the progression of users through levels and making changes to noticeable drop-off points can make a substantial difference to the success of the game at relatively little expense.

Be aware that when a game is available to the public, you may have little control over how well-publicised it is. The Science Museum’s Launchball game was initially soft-launched for final testing, but was on the front page of Digg within 12 hours.

4.   Summative evaluation

So you’ve released your game, carefully developed using the formative techniques described above. You’ve given it the best possible chance of being a great game that engages players with the subject matter, but you still want to be sure this has worked with the potentially huge and diverse international audience it could find online. You also want to be sure that you pick up unexpected reactions you didn’t think to test for beforehand. In some cases, perhaps the available time or resources to carry out testing during development was limited.

Post-launch, summative evaluation is key to answering these questions and fully understanding the reaction your game has provoked. It also enables you to test (and even further develop) your distribution strategies and marketing, and it provides vital information for developing future games. Several of the formative techniques described above can also be used at this stage, of course, but here we describe other methods available for evaluating this larger and more dispersed audience.

Google Analytics

Google Analytics (GA) is becoming the de facto standard tool for measuring online audiences (Finnis, Chan, and Clements, 2011). Despite privacy issues and possible European legislative hurdles surrounding the use of cookies, GA offers an unrivalled set of specific metrics for online content. It also offers full integration with Flash, both with and without ActionScript (Google Code, 2012), making adding in-depth analytics to a museum game relatively easy.

GA can be used first and foremost to accurately track the size of a game’s audience, a criterion for success in most cases. High Tea peaked when featured on Kongregate’s homepage, but within three months had fallen to a static “long tail” of a few thousand plays per day (Birchall and Henson, 2011); this is typical of a Flash game distributed through portals.

GA also allowed us to look at the demographic composition of the audience. For High Tea, we found that after the UK and the US, the game’s biggest audience was in Brazil. It’s not surprising that a techno-literate emerging economy should have a sizable audience for casual games, but we certainly hadn’t thought of the implications of the game’s content for a Brazilian audience. We were very interested in whether the game had been played widely in China, but anecdotal evidence (Xu, 2011) suggests that Google’s statistics for audiences within China is not reliable.

Beyond tracking users, however, GA can also be used to track events within the game. For High Tea, events were attached to the beginning and end of each level of the game. This allowed us not only to track individual “plays” of the game rather than simple loads of the page or Flash object, but also to track players’ progress through the game, seeing how many completed each level (in this case a year of the game’s ten-year narrative) and proceeded to the next. In effect, this allows you to plot something like a difficulty curve in summative evaluation from all players’ data, something that’s usually only possible during testing of a game. Unless you can modify the game post-launch, this may be of limited utility; but it may also be helpful in verifying initial assumptions about the game’s playability. GA functions can also be attached to learning events, to see how many players opted to find out more after playing the game, and to promotional link-throughs to institutional sites and social media actions.

With High Tea, we adopted our agency Preloaded’s strategy of “working with pirates”: making it deliberately easy for anyone to download the SWF file and republish it on their own website (Stuart, 2010). By using tags within the Flash that tracked the host, GA allowed us to see just how successful this strategy was. Looking at hosts revealed that just under 3 percent of plays happened on our own specially constructed website for the game, while 45 percent of plays happened on the major portals to which we seeded the game. But the overall majority of plays (52 percent) happened on websites that we had no formal relationship with (and in most cases were completely unaware of).

GA allowed us not only to quantify the effect of encouraging piracy, but also to demonstrate to those for whom intellectual property is an issue the fact that our strategy had doubled the audience for the game. It also had the added bonus of suggesting to us where we might look for comments and feedback on the game “in the wild” (see Analysing the “In the Wild” response section below).

Surveys and interviews

Quantitative data from analytics can provide some powerful insight, as described above. However, it cannot answer questions about the quality of player engagement: what they felt about the game or learnt from playing it. In some cases the analytics actually raise questions about player behavior that can only be answered by asking the players themselves.

There are several different ways to gather this sort of qualitative information. The methodology we’ve used at Wellcome Collection has been to start with a survey for players and then follow this up by interviewing a sample of survey respondents, both on the telephone and in person. We are fortunate to have an in-house research team in our Policy department that has assisted us greatly with both development and implementation of this qualitative research, but the technology for creating surveys is widely available and accessible, and often free. Along the way, we’ve learnt several key things about creating a good survey:

  1. Keep the questions relevant to your objectives and learning outcomes.
  2. Your audience will give up if the survey is hard to understand. Keep questions clear and concise, and keep answer formats consistent (all number scales or all ranges of “agreement”).
  3. Your audience will give up if the survey is too long. Depending on the software, you may be able to track incomplete surveys. You can’t answer everything with a survey. In particular, if a question is likely to require a complex nuanced response, this may be better for follow-up interviews.
  4. Put a statement up front that details how much time you think the survey will take and how the information will be used, to allay privacy concerns. Use a progress bar or other appropriate method to let users see how much further they need to go.
  5. Ask questions about demographics at the end. Respondents will be more comfortable answering this, and less bored by it, when they have already been asked for their opinion.
  6. Create survey questions that could be repeatable for future games, where appropriate. This enables you to make valid comparisons between them.
  7. Questions may be useful even if they duplicate information from the analytics (e.g., player’s country or site on which they played). This may feel redundant but in fact allows for cross-tabbing data (e.g., did people who played your game on Armor feel more positively about it than people who played on Kongregate).
  8. Leave one open field question for general comments; this can be really interesting. It’s possible to code up the answers to do further analysis on this as well (e.g., to analyse the percentage of respondents who discussed the theme or who made a positive or negative comment, etc.). It also gives the respondents a place to discuss what may be on their mind but not covered within the survey.
  9. Offer an incentive—even just a small one, such as a prize-draw entry for vouchers—to encourage responses and reward time spent. We find Amazon certificates to be popular, as they can be delivered by email, require little personally identifying information, and can be used for a wide variety of items.

10. Ask for contact details so you can follow up with telephone interviews. Get email and phone contact, including asking if individuals are willing to be contacted at a daytime number. Surprisingly, some respondents are more willing to be interviewed while at work.

In the case of High Tea, we followed up the survey with in-depth telephone interviews with seven players, and a small focus group with three people (Birchall and Henson, 2011). Using telephone interviews meant we could interview international players, in this case in Brazil, Canada, and the USA. We created a template protocol so that everyone was asked the same questions, designed to result in an interview of approximately 15 minutes long. This was felt to be the maximum time we could ask of people. We also offered an incentive of Amazon vouchers, something we feel is very important to encourage people to take part and feel they are being rewarded for their input, hopefully ensuring they do not resent you for taking up their time.

Telephone interviews provide the strongest and most useful information when they can build upon the survey responses, exploring user feedback in-depth, especially in unpacking any unclear survey responses. In High Tea, for example, the interviewers were able to address the curious question of why some survey respondents had felt more positively about the actions of the British Empire after playing the game (they simply felt they understood it better). Had we run the interviews at the same time or independently of the survey, we would have missed this opportunity.

Good resources are available supporting online survey design (Laboratory for Automation Psychology and Decision Processes, 2011), question design (Bradburn et al., 2004), and increasing response rate (Perkins, 2011).

Analysing the “In the Wild” response

Some of the most interesting feedback about a game can come from spaces over which you have no control. The very fact that it is uncontrolled can reveal the unintended or surprising ways in which people view your game.

Game portals such as Kongregate, Armor, or Newgrounds (as well as the smaller games sites that survive on ripped games) are also social, allowing players to comment, review, and rate games. These comments normally focus on gameplay and winning strategies, but with High Tea, we were pleasantly surprised to see the high number of thoughtful responses from players discussing the historical and economic aspects of the game. This alone was enough to show us that our aim of creating a game that engaged people with the subject matter had worked, and this feedback was a huge advantage of distributing to these sites. Depending on the game, it may also be possible to do more formal evaluation of user commentary. In the game WolfQuest, a downloadable role-playing game, we used randomly sampled threads with a random number generator and did dialogue analysis to document player understanding of scientific concepts.

Commentary is not just to be found on the games portals, though. You may find reviews appearing on blogs and games sites, and even YouTube, on which people will also comment. Google Alerts is very helpful for finding these, and it is worth setting up a few different combinations of search terms. You can also use the traffic sources data from the analytics, of course, or run your own searches.

We also found discussions taking place on Reddit, Metafilter, and other social news and link-sharing sites, along with forums for various topics and specialisms. Be aware that some of these discussions may happen in other languages, so you may need to be creative in your searching.

5.   Conclusions

In this paper, we have set out the case for the importance of evaluation for museum games, particularly those that seek to achieve engagement with museum content. Our understanding of good evaluation is that it is predicated on clearly set and prioritized objectives, measured with a well-defined set of evaluation tools. While formative evaluation is key to ensuring that you produce the game that you want to produce, summative evaluation can tell you much about the success of a game after its launch, including the unexpected and the intriguing.

We have presented a selection of both formative and summative methods. Few projects will have the luxury of incorporating all these into an overall evaluation plan, but we stress the benefits of multi-method evaluation, and in particular combining methods to refine or amplify your understanding, such as follow-on phone interviews from surveys.

As its title suggests, we aim for this paper to be a step towards identifying best practice rather than supplying definitive answers. We hope that our session at Museums and the Web 2012 will provoke discussion and the sharing of others’ experiences in game evaluation, which will in turn inform our own future practice. A useful space for discussion of museum games and their evaluations has been created (, and we would encourage you to add your own thoughts and examples there.


Birchall, D., and M. Henson. (2011). High Tea Evaluation Report. Consulted January 28, 2012. Available at:

Bradburn, N.M., S. Sudman, and B. Wansink. (2004). Asking Questions: The Definitive Guide to Questionnaire Design—For Market Research, Political Polls, and Social and Health Questionnaires. Revised edition. Jossey-Bass.

Casual Games Association. (2007). Casual Games Market Report 2007. Consulted January 28, 2012. Available at:

Cutting, J. (2011). “Telling Stories with Games.” In Museums At Play: Games, Interaction and Learning. Museums Etc.

Finnis, J., S. Chan, and R. Clements. (2011). Let’s Get Real: How to Evaluate Online Success? Report from the Culture24 Action Research Project. Consulted January 28, 2012. Available at:

Google Code. (2012). “Google Analytics—Adobe Flash Setup.” Consulted January 28, 2012. Available at:

Ipsos MORI. (2006). “Computer Games for Learning.” Consulted January 28, 2012. Available at:

Ipsos MORI. (2009). “Young people Omnibus 2009 (Wave 15): A research study amongst 11-16 year olds on behalf of the Science Museum.”

Laboratory for Automation Psychology and Decision Processes. (2011). Online Survey Design Guide. Consulted January 31, 2012. Available at:

Perkins, R.A. (2011). “Using Research-Based Practices to Increase Response Rates of Web-Based Surveys.” EDUCAUSE Quarterly, 34(2). Consulted January 31, 2012. Available at:

Ridge, M. (2011, March 31). “Playing with Difficult Objects – Game Designs to Improve Museum Collections.” In J. Trant and D. Bearman (eds.), Museums and the Web 2011: Proceedings. Toronto: Archives & Museum Informatics. Consulted January 30, 2012. Available at:

Stuart, P. (2010). “How we publish an online game.” Preloaded blog. Consulted January 28, 2012. Available at

Xu, M. (2011). “Dilemma of Google Analytics in China.” Consulted January 28, 2012. Available at: