Providing Accessible Online Collections
Rachael Rainbow, Alex Morrison, Cogapp, UK; Matt Morgan, The Metropolitan Museum of Art, USA
Feedback from user research for the recently re-launched Metropolitan Museum of Art website made it clear that users need multiple routes into the collections online. Visitors want to search across data, on multiple dimensions, and find other artworks that may interest them. Enabling this with the data available was a huge challenge. Museum collections data is not necessarily structured to support such multi-dimensional searching, especially in a collection as diverse as the Met’s. In this paper we discuss the system we built to enable the Museum to analyse the data and construct rules to support these user journeys. With this system, Museum staff match terms against standard vocabularies (including AAT and TGN), provide contextual interpretation of terms by department, qualify terms such as locations, and define rules for common misspellings. A powerful Solr search system provides the quick and complex search functionality, establishes the relationships between artworks, provides free-text searching, and gives facet counts.
Keywords: Collections, search, Solr, vocabularies, metadata
1. The Metropolitan Museum of Art’s Collections
The Metropolitan Museum of Art in New York is one of the world's largest and finest art museums. Founded in 1870, its encyclopedic collections include hundreds of thousands of works of art, spanning more than five thousand years of world culture, from prehistory to the present and from every part of the globe.
In 2009, The Metropolitan Museum of Art commissioned Cogapp to redesign their website. The Museum had already made some of their collection available to the public online but user research carried out as part of the website re-launch project highlighted a number of issues with the existing presentation and functionality:
- Poor search results being returned by the search engine (for example a search for ‘Rembrandt’ returning a selection of African sculptures).
- The first results page of the collection database was discouraging to some users as many of the results were written in languages other than English.
- It was clear that many users wanted to use the collection database to discover new artworks. These users would use search terms such as artists’ names, art movements, dates, geographical regions or cultures, yet they frequently struggled to find relevant results using these terms.
Figure 1. Existing online collection
2. User research findings
Cogapp approached the website re-launch from a user centred perspective. We carried out initial one-to-one interviews with 46 users and potential users from a wide range of audiences (including general museum visitors, frequent museum visitors, members, researchers/academics, families and educators) and followed this up with card sorting exercises with a further 14 users.
From the user research we established a number of user goals for the artworks and collections including:
- Find a specific artwork.
- Find artworks by time period.
- Research specific civilizations.
- Search by a place or country.
- Find artworks by genre.
- Search for paintings by 'school.'
- Search for types of artifacts.
- Find out about a specific artist.
In order to understand more about how users thought about the artworks in the collections, we carried out an open card sort (a card sort whereby users can create and name their own categories). Users were presented with a deck of 63 artwork cards which were drawn from the first three objects from the ‘Collection highlights’ of each department on the existing website. The cards were printed with an image of each object and tombstone data. The order of the data fields on the cards was randomized to prevent participants easily honing in on a single facet to sort by.
Figure 2. Artwork card for card sorting
When sorting the art cards participant’s with a pre-existing knowledge of art created a larger number of categories and displayed a greater anxiety to ‘get it right’.
At one end of the scale Makala (aged 10) diligently sorted the piles into ‘keepers and non-keepers’, which she then labeled ‘Pieces of artwork that would interest children’ and ‘Pieces of artwork that would bore them’.
With general visitors, personal experience became the determining factor. The response was strongly visual and was easier when the object was clearly apparent. For example, there was a major split between 2D and 3D objects (‘that’s a pot, that’s a painting’), between materials (drawings and paintings), between obvious geography (an American pile was a common theme) and age (‘it looks old’).
General visitors sorted quickly and had few dilemmas categorizing the cards outside of understanding what some of the objects were. For frequent visitors, a little knowledge led to their creating a few more groups and expressing that the card sort was complex and would be easier if they could label by multi-facets.
A few favourite artists were given categories of their own:
"Di Vinci belongs to a class by himself" – Gladys (frequent visitor) with familiar artists she pulls out by name.
Frequent visitors still tended to name groups after what they saw on the card images e.g. photography, instruments, but also started to use the curatorial departments they were familiar with i.e. American decorative arts. The other common sorting patterns were geography and artistic period.
The researcher/academic participants spent more time reading the data on the cards than referring and responding to the images.
Throughout the card-sorting sessions, it was clear that users wanted to explore the artworks and learn more about them. Some of the most common goals users reported relate to learning about the context of an artwork. Why was it made? Who was the artist? What is it made of? Why has the Museum selected it? What other works does it relate to? It is through building these connections that people are able to further engage with the work.
"We love the marvelous conjunctions, the curiosity chest is more exciting. I’m reaffirming this boring Janson breakdown but I’m open to new ways to explore and see” – Virginia (researcher / academic)
Some of the most positive experiences we observed were where those with a little less knowledge about art were able to learn something about an object they liked. A memorable example was a infrequent museum visitor who, having spent some time exploring a dragon illustration, explained that it could inspire a tattoo. This experience was especially powerful when that knowledge could then be used to build connections between objects.
The importance of finding an entry point into a collection from an artwork that resonates with the user is something we have observed on multiple occasions. For example, in user testing with the British Museum in 2006, we saw a carpenter who was unenthused by the stone sculptures he saw but got immediately more engaged when he encountered objects made of wood, a material he understood. Similarly when testing a collection of ‘Icons of England’ for the UK’s Department of Culture, Media and Sport, a young car mechanic who had not understood the importance or relevance of Constable’s ‘The Hay Wain’ suddenly understood the concept of an icon when he saw the Mini Cooper was included in the list and surmised that, as the Mini is an important car, then ‘The Hay Wain’ must be an important painting.
Another request we have seen frequently in user testing for museum websites is for large object images. This desire was strongly reinforced in the user testing for the Met. For many people this is the key component of the artwork record.
3. Accessing the collection
The user research we conducted made it clear that users need multiple routes into the collections including straight searching for experienced users, and more guided routes for less knowledgeable visitors.
During the subsequent development of the information architecture and wireframes for the new website, we developed a number of routes into the collections including a concept of search facets. These are aspects of a collection object or artwork by which site visitors can refine their search, choose an area of interest to browse or discover relationships between artworks. Exposing the metadata within the facets enables a user to quickly and easily find his or her particular areas of interest:
Who - the artist, maker or the culture of the work
What - the technique or material used for the artwork
Where - geographic location
When – era or date the artwork was made
In the Museum – the department in the Museum
In order to provide an entry point for less experienced users and for users who want to be introduced to a broad selection from the Museum’s collections, the website provides two ways of suggesting artworks: Browse Highlights and Artwork of the Day.
Browse Highlights comprises1,479 highlighted works that are displayed prominently in a slideshow on the collections landing page and on a dedicated ‘Browse Highlights’ page. Highly visual and designed to provide a selection of easy entry points, the highlights in the slideshow are randomized and change daily to ensure that repeat visitors are presented with a varied selection.
Figure 3. The Browse Highlights slideshow
Editorially selected, Artwork of the Day was a feature from the previous website that proved popular with users. It is available as a tab from the Collections landing page and as an RSS feed and provides a historically-based introduction to selected works from within the collections.
To meet the needs of more experienced users who are searching for something specific, we provide a collections search consisting of a free text search box. As users type in their search terms, an “auto suggest” feature automatically suggests the terms they may be looking for by completing the search term with words drawn from the artwork title, artist name, where facet, what facet and accession number. This feature both assists users and reduces the risk of failed searches due to misspellings. Artworks from the collection are also available from the site wide search.
Figure 4. The collections search showing the auto suggest
For users who want to search by a specific aspect such as place or time period, we have included the ability to search within facets, such as by geography or time period. As well as a free text search (with auto suggest drawn only from that facet) the Museum provides a number of editorially-controlled suggested searches. The suggested searches provide both a way into artworks the Museum has decided to promote, and an overview for users about the type of terms likely to produce results in this search.
Figure 5. The facets search
Once a search has been conducted, the results can be further narrowed down by searching for index terms within any of the facets, enabling the user to build up complex queries such as “Lacquer + Japan + AD 1800-1900.”
The user research showed that in order to draw users into exploring the collections and enable them to make onward journeys from their chosen entry point, connections between objects are particularly important to let users drill down, expand out and relate across the collection as appropriate.
In order to do this, on the individual artwork pages we include a ‘see also’ section which lists all the index terms applicable to the artwork divided into the five facets. This enables users to explore the collection further based on whatever aspect of the work appeals to them. We also suggest related artworks they might be interested in drawn from works by the same artist or, where that is not possible, works that share the same index terms.
Figure 6. The artwork page
4. Museum collections and data issues
Providing the variety of search and browse options needed to satisfy the requirements of the various user groups can only be achieved if the underlying data supports it. When the collections are as large and wide-ranging as the Met’s, this provides a number of challenges. In designing a search system with multiple entry routes and connections between artworks we had to be aware of – and find an approach to – a number of data issues common to many museum collections.
The data recorded in the collections management system is used for a variety of purposes and is not collected with the primary purpose of display on the public website. It is the museum's primary source of scholarly object information and must not be changed to meet the requirements of the website.
Different artworks have different data associated with them. For example, for departments such as Egyptian Art or Greek and Roman Art there are many artworks where no artist is credited with the work, so a search by artist name only would return very few results for these collections.
Spellings, particularly the names of people and places, vary.
Whilst the website is likely to mainly in one language (in this case English), preferred terms may be in another language.
Data may come from multiple sources, such as different collection management systems per department.
Different departments may use fields within the collections management system in different ways.
Even with an agreed taxonomy, the granularity of the terms used may vary between departments (for example, what is a ‘painting’ in one department might be described as a ‘watercolor’ in another).
5. Data processor and standard vocabularies
In order to generate index terms for the five facets, we use the following fields:
- Who – artist’s name (+ culture for some departments)
- Where – geography (+ culture if not included in ‘who’) + artist’s nationality (where available)
- What – medium + classification + object name
- When – date (beginning and end dates)
- In the Museum – Museum department
In order to provide the index terms within the facets, we added another layer of data processing between the collections database and the online collection. This enables us to perform some additional processing before the data is presented on the website, without changing the underlying collections data in any way.
The data processing stage has two elements:
- Mapping the collections data to a number of thesauri.
- Mapping rules set by the Museum staff about how the data should be treated.
This permits us to:
- Set preferred terms.
- Deal with spelling variants and misspellings.
- Exclude terms that are not meaningful to end users (e.g. ‘non-organic matter’) or not meaningful in a specific context. For example, what is relevant in one branch of a taxonomy may not be the most relevant to the institution’s collection. (‘Stool’ was a memorable issue for the Met as the standard vocabulary term used matched ‘stool’ as an item of furniture to 'stool' the biological term, both of which are relevant in the Met's collection, but never in the same artwork.)
- Enable inferences to be made (if an artwork is from Paris, France, we can infer French and European).
- Assign priorities to terms, for example for the European Paintings department when the term ‘Copenhagen’ is found, Copenhagen in Denmark is prioritized over Copenhagen in Louisiana, in the absence of other disambiguating data.
For the ‘who’ facet we are using an internal Museum consolidated list of artists’ names; for ‘what’ and ‘where’ we are using the Getty Vocabularies, Art and Architecture Thesaurus (AAT) and the Thesaurus of Geographic Names (TGN) respectively.
The collections data is processed in three stages:
Figure 7. The three stages of data processing
- Raw term filtering stage:
- Source term extraction stage:
- Split. The raw term string for any given field is split into sub-strings according to a list of recognized characters and regular expressions. For example
- Special characters and words such as commas, full stops, slashes and words such as “and” and “or”.
- Regular expressions that look for certain recognized patterns such as “a) [term1]. b) [term2]”.
- Match. The sub-strings extracted from the split step are then matched against another list of regular expressions that look for known patterns such as:
- [Term1] ([Term2])
- ([Term1]) [Term2]
- [Term1] and [Term2]
- Apply Rules. The mapping rules are applied to transform extracted source terms or remove them from further processing.
- Split. The raw term string for any given field is split into sub-strings according to a list of recognized characters and regular expressions. For example
- Source term matching stage
- Apply term mapping rules. If there is a mapping to say that a given source term is to be mapped to a specific vocabulary subject, then apply that mapping.
- Determine subject mapping. Each source term is matched in turn to the terms within the relevant vocabulary and a list of matching subjects is extracted. The following steps apply:
- Each list is sorted according to the depth of the hierarchy position of the term. This means that if there really is no other way to determine a correct subject match, the subject with the shallowest position in the hierarchy is used.
- Each list is sorted by priority if a priority rule exists for the source term being matched. Priority rules can therefore be used to determine which subject from the list is to be used if no other qualifying information is found.
- Each list is matched against each other to see if any subjects found qualify any of the others, e.g. if there are two source terms are “Paris” and “France” then “Paris” will return multiple subjects from TGN (there are many towns and cities in the world called Paris). The data processor will, however, compare all subjects found for the term “France” against them to see if any subjects are parents or children in the hierarchy. Thus “France” in this case will qualify “Paris” and “Paris, France” and select the correct subject accordingly.
This stage applies rules to perform basic cleanup on the raw data from the collection object. Museum staff can change the raw field values found on collection objects before any processing is performed on them. This allows them, for example, to correct spelling mistakes and to filter out any undesired terms.
Each rule can be defined on a “per department” basis, or the rule can be applied to data from any department. In the case that a valid rule for a given term is defined for all departments and for a specific department, the most specific rule is applied.
Each rule can be applied either to a specific collection object field or to all fields on a collection object. If valid rules are defined for all fields and for a specific field, the most specific rule is applied.
The algorithms for extracting meaningful source terms from the Raw Terms output from the previous stage are performed in three steps.
These algorithms take the list of source terms extracted in the previous stage and attempt to match them to the known subjects in the AAT, TGN or consolidated list of artist names. In the case of the “What” and “Where” facets, if multiple matches are found these algorithms also attempt to determine which match is the correct one.
In order that the Museum staff can set the necessary data processing rules, we built an administration interface for the data processor and an object viewer that allows the user to test run collection objects through the data processor to see how the current rules set will index the object. The results of the object viewer test run are not stored or indexed but, given the current rules set, it shows the various index terms for each facet that will be indexed on the next run.
6. The Search System
To provide the online collections search, including the facets and index terms that provide the onward journeys (as well as the other searches on the website such as events and site search) we needed a powerful search system.
As search was a key service, we did not limit our review of options to software that uses the same development platform as the content management system (i.e. .NET), though we only considered platforms that would run on the Museum's choice of operating system, Microsoft Windows. We reviewed and undertook extensive testing with various alternatives across software and hardware, and both proprietary and open source solutions, including Microsoft FAST ESP, Autonomy, Apache Solr/Lucene, and Google Mini.
The speed at which a set of search results is returned to the user is critical to the usability and success of the website. Whatever the size of the dataset returned from a specific search term, it was envisaged that a search and response cycle should take no longer than one second to complete, and drilling down through results by a facet should be equally efficient, bearing in mind the website would also need to interact with other display components to get the results out to the website visitor.
As integration to specific external data sources was required, including the collections and events databases, the ability to perform custom integration work and development was also crucial.
Solr, a freely available, open source software product that provides powerful search capabilities, addressed all the requirements including the provision of simple, powerful administration and configuration interfaces, scalability and caching and faceted search out the box. Solr has a proven track record in the web search arena (over enterprise search providers such as Microsoft FAST, which is directed more towards document searching across an intranet) including successful integration with many content management systems.
Solr provides all of the capabilities listed in the Museum’s search requirements definitions, and is flexible enough to adapt to changing requirements. It provides the right balance of power and flexibility for all of the functionality required in the site redevelopment. Despite being an open source product, options exist for the commercial support of Solr via various third parties so the Museum has a choice of expertise, rather than being directly tied to a single source of knowledge which some of the large proprietary systems may have required.
Solr excels in configurability, scaling, performance and query time enhancements including advanced text analysis with the ability to split and pluralize, and relevance ranking for ordering results based on criteria. In addition to the basic functions, Solr provides various advanced search features including spell-checking, search highlighting and auto-suggest based on end user input.
We have been extremely impressed with the speed of Solr after implementation, and complex search queries across thousands of documents return within a matter of milli-seconds, fulfilling the original requirement.
In the Museum’s online collection architecture the Solr indexing and delivery functions are split out for performance and security reasons.
Figure 8. The architecture of the online collections system
Solr pulls data from the data processor prior to indexing. Solr provides a number of Solr text analysis filters that we are using at index time including:
- Stop Word – discards common words from a pre-defined list
- Word Delimiter – splits words into sub-words and performs optional transformations on sub-word groups
- Lower Case – puts all words in lower case
- ASCII Folding – converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
- Snowball Porter – stemming (a process for reducing inflected or derived words to their stem, base or root form) for several languages
- Whitespace Tokenizer – removes spaces
- HTML stripper – removes HTML tags
The same set is generally used at query time, where we also use other functionality including:
- Faceting (displaying aspects or views of a subject, in this case a collection object) with data drawn from multiple fields
- Prefix Facet Query , which matches the characters entered by user input to results in Solr, to form the basis for the auto-suggest
In order to cope with both the large quantity of data and the large audience numbers, the search system uses techniques to enhance scaling and performance including:
- Multiple cores – A core is an individual instance of Solr with a unique schema. A Solr server could have several instances all with different schemas. We have split the index into separate cores based around functional areas
- Sharded search – The site search combines the search results from multiple cores.
- Replication – We have automatic replication from an indexing machine out to two dedicated Solr delivery servers.
- Load balanced Solr delivery servers.
The search system we have designed and built in order to meet the variety of user needs is undeniably complex; however, feedback from users so far suggests the effort has been worthwhile:
“Perusing the new @metmuseum site. Loving the collections highlights … so much metadata” – Ben Fino-Radin via Twitter
“I am in love with the @metmuseum new website. Esp the artwork search. What a wonderful resource!” – Cynthia Wenslow via Twitter
The analytics of the relaunched show repeat visits, time on site, pages and visits are all well above the earlier site’s average so we know that the Collections area is a favorite part of the new site for visitors and it is doing a good job of attracting and engaging them.
There is undoubtedly effort needed to create and test the mapping rules for the data processor. However, doing so has improved the experience for users significantly, and the ability to make changes to the online collection without having to amend the underlying collections data has reassured many curators. They now know that if they see any sub-optimal results they can simply ask for the rules to be amended.
The Solr search system is performing well and returning relevant results. With enough memory and servers Solr is extremely fast and enables us to do things on the fly that in a previous generation of systems we would have pre-compiled. It enables us to build up very complicated queries with facet counts.
The downside to the extensive use of Solr is that it is another component in the system and the amount of work needed to implement, integrate and maintain it is significant. As we have used Solr so extensively it is now an essential part of the online collection (and the wider site) so if there are any problems the effect is far reaching.
Whilst very powerful, Solr can be complicated. Significant customization is required to get the right results and to optimize performance. For example, in the unoptimized form, ‘school of Rembrandt’ had same rating as ‘Rembrandt’; and works with little data except for tombstone information (such as minor works attributed to ‘school of Rembrandt’) appeared above those with a lot of associated material (such as major works by Rembrandt). We therefore had to make a number of tweaks to the search weightings to make sure the most useful results appeared first.
In conclusion, we have found that when faced with a very large collection, automated term matching via rules and thesauri is very valuable, but human intervention is always needed to review and improve on it. If Solr is used as a store for this data, complex search interfaces that provide a wide variety of entry points and connections between works can be constructed.