Using an RDF Data Pipeline to Implement Cross-Collection Search
David Henry and Eric Brown, Missouri History Museum, USA
This paper presents an approach to transforming data from many diverse sources in support of a semantic cross-collection search application. It describes the vision and goals for a semantic cross-collection search and examines the challenges of supporting search of that kind using very diverse data sources. The paper makes the case for supporting semantic cross-collection search using semantic web technologies and standards including Resource Descriptive Framework (RDF), SPARQL Protocol and RDF Query Language (SPARQL ), and an XML mapping language. The Missouri History Museum has developed a prototype method for transforming diverse data sources into a data repository and search index that can support a semantic cross-collection search. The method presented in this paper is a data pipeline that transforms diverse data into localized RDF; then transforms the localized RDF into more generalized RDF graphs using common vocabularies; and ultimately transforms generalized RDF graphs into a Solr search index to support a semantic cross-collection search. Limitations and challenges of this approach are detailed in the paper.
Keywords: RDF, data integration, search, semantic web, Solr, SPARQL
The Missouri History Museum (MHM) launched a beta version cross-collection search in mid-2010. That implementation uses Solr (http://lucene.apache.org/solr/) as a search engine with data from multiple domains (objects, archives, and photo collections) indexed to Solr documents with various PHP scripts. Although some attempts were made to map data values to specific locations, dates, and subjects, most of the data is indexed as text fields. After user feedback and considering what would be possible with semantic web tools, we identified a number of limitations to this approach: 1) text-based facets are often ambiguous or vague; 2) users do not have the ability to further explore the context of search results; 3) data from different domains must be “watered down” to conform to the search index; 4) data is not available as linked open data; and 5) we are missing an opportunity to contribute to the web of data through crowd sourcing.
To overcome these limitations, the prototype of our next version is built firmly on sematic web technologies: RDF, SPARQL, and triple/quad stores. The prototype strives to meet the following requirements: 1) aggregate and index data from different domains and from multiple institutions; 2) expose a repository of linked open data; 3) index by specific data types (personal and corporate entities, geocoded locations, date ranges, and common vocabularies); 4) maintain the rich context of data from multiple domains and institutions; and 5) allow users to make contextual links between items, people, locations, dates, and subjects—thereby contributing to the web of data. At the heart of this approach is a semantic data pipeline that “ingests” mixed data, converts it to RDF, transforms data to common data formats, aligns collection-specific fields to common fields, indexes to Solr, and provides both a faceted search interface and a SPARQL endpoint for linked open data.
The prototype approach presented here is not a single application; rather it is a proposed pipeline that uses many different tools that conform to semantic web standards such as RDF, RDFS, OWL, and SPARQL. Since they rely on web standards, tools may be swapped out for others that serve the same purpose without rebuilding the pipeline. For example, there are many tools for converting between various data formats—called “RDFizers” (http://simile.mit.edu/wiki/RDFizers)—but our prototype uses D2RQ (http://www4.wiwiss.fu-berlin.de/bizer/d2rq/) for converting relational databases to RDF; and modmarc2RDF (http://simile.mit.edu/repository/RDFizers/marcmods2rdf/) for converting MARC records to RDF. There are many tools for converting between various RDF formats—our prototype uses the ARC2 library. Many tools exist for storing and querying RDF—we use Sesame. Finally, there are several tools available for transforming/mapping RDF data—we use a custom-built processor for interpreting an XML mapping language proposed by Kondylakis et al. (2006).
The challenge of cross-collection search
Users can search Internet resources through a single search engine query, yet often the resources of a single cultural institution or university campus are segregated into silos, each with its own dedicated search system. The prominence of multidisciplinary research, the increase in the use of primary materials, and the desire to make new connections across disparate materials all would be advanced by the offering of single search to open up all the collections to the researcher. (Prescott and Erway, 2011)
This quote from a recent OCLC report about single search (or cross-collection search) sums up a common vision for cross-collection search. Users should be able to search across diverse collections from a single search interface and make new connections across disparate materials. Today, there are several notable examples of cross-collection search available to web users—providing both keyword and faceted searching across diverse collections. While these cross-collection searches have made great strides in providing keyword searches, they are limited in terms of providing domain-specific contexts to those diverse collections. Without the domain-specific contexts, it is less likely that users will be able to quickly find the resources they are seeking and even less likely that they will be able make meaningful connections across disparate materials.
Data from multiple collections—even in the same institution—come from different data management systems with differing data structures, field names, and data types. Even where standards are used in a data management system, those standards tend to differ by domain. To overcome these data differences, developers of existing cross-collection search applications have been forced to map domain-specific data structures to a generalized data structure—for example “artist,” “photographer,” and “author” may all be generalized as “creator.” Oftentimes a person, place, event, or some other facet may be related to an item (search result) solely by keywords. By mapping to a more generalized data structure, we tend to lose the specific relationship between these facets and a given item. This makes searching more time consuming because the user must sift through irrelevant results, and the only connections a user can make are by keywords (as comments) or tags.
At MHM, our current cross-collection search (http://collections.mohistory.org/search/) suffers from the limitations described above. We are indexing to a Solr schema using various custom-built indexing scripts. The relationships between collected items and facets are limited to keywords or incomplete relationships to people, places, events, and topics. For example, we have many items that relate to the 1904 World’s Fair. Unfortunately, the 1904 World’s Fair is also known as “The St. Louis World’s Fair” and the “Louisiana Purchase Exposition”—all of these names refer to the same event. Since our index is based on keywords, searching for “1904 World’s Fair” may not result in items indexed by “St. Louis World’s Fair” or the “Louisiana Purchase Exposition.” There are similar problems with variations in a person or organization name—for example, the former senator from Missouri known as both “Richard Gephardt” and “Dick Gephardt.” Even item types can suffer from the same ambiguity. For example, there are 16,480 items categorized as “photo” and 12,542 items categorized as “photograph.”
Since these indexing problems can result in a cumbersome searching experience for the user, we need to limit or completely alleviate these ambiguities in our next version of cross-collection search. Beyond search and discovery, we want to provide an interface where our users can make meaningful connections between items in our collections and facets such as people, businesses, organizations, places, events, and topics. We often see comments in our current cross-collection search which represent attempts to make such connections. For example, users identify people shown in portraits including more information such as date and place of birth and/or date and place of death. Similarly, users will identify a photograph of a house linking the house to an address and/or residents of the house at given dates. Unfortunately, with our current search indexing, we can only index those comments as keywords; it is not possible to link to specific people, places, and/or events.
Given the limitations of our current search and our vision of building meaningful connections, our goals in the next version of cross-collection search include:
- Clearly defining specific relationships between collected items and entities such as person, organization, place, event, and topic – for example, the user should be able to make a distinction between “painter,” “photographer,” and “author”;
- Uniquely identifying entities used in facets – for example “1904 World’s Fair,” “St. Louis World’s Fair,” and “Louisiana Purchase Exposition” should all have the same unique identifier; similarly, if referring to the senator from Missouri, “Richard Gephardt” and “Dick Gephardt” should have the same identifier;
- Providing an interface and data infrastructure to allow our users to make specific relationships between items and entities.
Of course, simply meeting these data goals is not sufficient. There are a number of implementation goals that must also be met to provide a useful search tool in the current environment. These implementation goals include:
- Searching must be very fast—in other words, as close to immediate as possible;
- Data sources must be considered very diverse to accommodate not only collections within an institution but also collections from a wide range of external institutions;
- Data records should be considered “almost live”—it should be possible to update records on a daily or even hourly basis, and that update should be as automated as possible.
Leveraging semantic web technologies
The semantic web is a term coined by Tim Berners-Lee, who is credited with developing the initial building blocks of the web (http and html). Berners-Lee envisions the future web being capable of providing meaningful connections between resources that intelligent agents (software) could use to answer more complex questions, such as, “Find all doctors in a 20 mile radius that specialize in sports injuries,” or “Find all businesses in the St. Louis area established before 1920” (Berners-Lee et al., 2001). This kind of interaction with web-based data is starting to appear in smart-phone “apps.” The difference between these smart-phone apps and what Berners-Lee envisions is that apps deliver services as a layer of data managed separately from the rest of the web, and Berners-Lee proposes that such meaningful connections should become inherent in the web—or part of the web standards. A major difference is that questions that can be answered by any given app fall within predefined logical patterns, whereas a semantic web would allow a user to define his/her own questions.
Some critics of the semantic web concept have argued that we may never achieve the kind of intelligent web that Berners-Lee proposed (Anderson and Rainie, 2010). Yet there is one semantic web tool which has grown in use and implementation—the Resource Description Framework (RDF). RDF is “a foundation for processing metadata; it provides interoperability between applications that exchange machine-understandable information on the Web.” (W3C, 1999) At its core, RDF relies on “statements” about web resources, such as, “The Missouri History Museum was founded on August 11, 1866.” These statements are created with a well-defined syntax with subject, predicate, and object, such as:
<http://www.mohistory.org> <http://www.example.org/terms/creation-date> “08/11/1866”
Each part of an RDF statement (subject, predicate, and object) must be a uniform resource identifier (URI) or a literal (such as “08/11/1866”). For our purposes, each part would be a resolvable URL—such as http://www.mohistory.org/people/Archibald/Robert/0001.
The great advantage of RDF over other data structures—such as databases or XML—is flexibility. “RDF provides a general, flexible method to decompose any knowledge into small pieces, called triples, with some rules about the semantics (meaning) of those pieces”Tauberer, 2008. This flexibility allows us to take data from a localized source, such as a database, and decompose it into RDF triples with precise localized meaning by stipulating the namespace for that localized source. By defining vocabularies using the OWL (Web Ontology Language), we can map local vocabularies to a common vocabulary using emerging standards and tools for RDF. This approach is similar to the “hybrid ontology” data integration approach described by Cruz and Xiao (2005).
For some, the ultimate vision for the semantic web includes identifying the subjects, predicates, and objects of triples by globally unique identifiers that would have the same meaning in any query. It is clear that the semantic web is far from reaching that goal—in fact, it may never reach that goal. But there is still value in RDF without ever reaching the ideal point where one global URI represents one object, concept, or entity. The flexibility of RDF allows us to accommodate the fact that we will have multiple URIs for objects, concepts, and entities. We can leverage existing semantic web tools—RDFizers, RDF parsers, triple stores, and SPARQL—to help expose diverse data sources in a meaningful cross-collection search. What’s important for our purpose is that the same URI is relevant within our cross-collection search, even if that URI is not a globally unique identifier. When other URIs are discovered, they can be linked to our URI by using predicates such as “OWL:sameAs” or “SKOS:closeMatch.” And by establishing and maintaining a set of resolvable URLs to web resources that represent resources, entities, subjects, and relationships—at least within that single search index—we position ourselves to enable intelligent linking of these resources.
A data pipeline for cross-collection search
The data pipeline developed at MHM is composed of the following steps: 1) accept source data in mixed but accessible formats; 2) convert source data into local RDF using one or more RDFizers; 3) map local RDF to common RDF using a common mapping language; and 4) index common RDF to a Solr index using dynamic fields (see Figure 1). Each step of the pipeline will be described in turn.
Figure 1. An overview of the Missouri History Museum’s RDF pipeline
Step 1. Accept source data in mixed but accessible formats.
Ideally, we would want the capability to easily index data from any source. In reality, this is not possible. To RDFize a data source, we would need data that meets some basic criteria. The data must be sufficiently structured to be converted into triples (subjects, predicates, and objects). Unless, we are starting with existing RDF data or some XML, the objects extracted from source data will be in “literal” form (usually text, integers, or dates)—they will not be defined as URIs. However, these literals must be associated subjects and predicates that may be defined as URIs. To define subjects as URIs, we would need unique records in the source data, such as rows in a table, MARC records, or nodes in an XML document. To define predicates as URIs, we would need fields in the source data, such as columns in a table, MARC properties, or tags in an XML document. A collection of text documents would not usually meet these criteria. In addition, some data sources may be in a proprietary data storage format; these would also be restricted from this indexing process. For MHM’s prototype RDF pipeline, data sources include various relational databases, MARC records, and XML files.
Step 2. Convert source data into local RDF using one or more RDFizers.
An RDFizer extracts records from source data and converts them into RDF triples. The resulting triples can be considered “local” RDF—meaning that the objects tend to be literal, and any URIs are specific to the data source. These local triples should represent the local data as closely as possible. At this point, there should be no attempt to map to a common set of predicates, link to existing subjects, or convert primitive values (most source values will tend to be either text or number).
The challenge at this step is configuring the data source and RDFizer so that data updates are seamless and automated. RDFizer tools vary in terms of system requirements and interfaces. Some require certain programming libraries and/or supporting applications. Some may be run from a command-line, while others may require opening a graphical user interface (GUI). The RDFizers we used to convert data sources for MHM’s cross collection search (D2RQ, marcmods2rdf) required java servlets and supporting java libraries. The tools we used could be run from the command line and, therefore, can be run as scheduled jobs (cron on unix-based systems). An added benefit of the D2RQ tool is that it can expose an existing relational database as a SPARQL endpoint so that the RDFizing is live.
In some cases, data will need to be reformatted to enable error-free RDFizing. For example, database tables or column names that include spaces and punctuation will have to be renamed without spaces or punctuation; primary keys will need to be added to tables that do not have them; characters encoded as something other than UTF8 (for example, some text copied from Microsoft Word) must be converted to UTF8 encoding; views should be removed from the database; and keys defined as unique with non-unique values should be removed. XML data should be well formed and have consistent encoding—in some cases, an RDFizer may require that the XML is valid to a given schema. To the extent that these problems exist in the source data, the likelihood of managing automated updates is reduced.
Step 3. Map local RDF to common RDF using a common mapping language.
To achieve the goal of discovering common connections between diverse collections, we need to develop vocabularies with predicates that: 1) may be common across all records; 2) are identified by a “dereferenceable” URI; and 3) contain data that are clearly defined and dereferenceable. Since local RDF triples may use subjects and predicates that are specific only to the data source, we need to map those local RDF triples to “common” RDF triples that use common subjects, predicates, and where possible, objects.
This step is probably the most challenging of the steps in pipeline. Mapping from one set of triples to another is non-trivial—involving judgments about the meaning of source triples and the conversion of literal data into often complex RDF graphs. Several tools exist to help perform this kind of mapping including TopBraid Suite from TopQuadrant; PoolParty from poolparty.punkt.at; Neon Toolkit from www.neon-project.org; Protégé from protege.stanford.edu; and DERI Pipes from pipes.deri.org. For a variety of reasons, we decided to build our own processor for this step rather than incorporate one of the tools listed above. No matter what tool is used, the process of mapping data requires knowledge of the domains involved and expertise in managing meta-data.
At MHM, we have imposed a number of strict requirements for the kind of RDF graphs that should result from this mapping. First, where possible, we need to convert literal objects (such as names, places, subjects, and events) into RDF graphs. For example, a place such as “705 Olive Street, St. Louis, Missouri” could be converted to the graph represented in figure 2. We may also wish to convert a simple triple into a complex set of graphs to convert to an activity-based schema such as the CIDOC-CRM (http://www.cidoc-crm.org/) or LIDO (http://www.lido-schema.org). Second, where possible, we want to use (and reuse) URIs that are dereferenceable. This is important for meeting some of the emerging best practices for linked open data (http://linkeddatabook.com/editions/1.0/#htoc11) and makes it possible for us to build on resources over time by adding triples where there are relevant connections (for example, relating vital dates and events to a person where that person is defined by a dereferenceable URI). At MHM, we are assigning permanent URIs to entities such as people, businesses, organizations, places, events, item types, and topics. Finally, our common RDF should allow for a mix of vocabularies, including both commonly used vocabularies and vocabularies specific to a particular sub-domain. For example, we should leverage existing vocabularies such as Dublin Core (dc) and Friend of a Friend (foaf) along with our own custom vocabularies. Alignments such as these would make our data more useful in the larger linked open data community. But collection-specific vocabularies may also be added to avoid losing the context of an item. For example, just as the distinction between publisher, author, and editor is important context in a collection of published works, the distinction between furniture designer, cabinetmaker, and retailer may be important for a collection of furniture. These collection specifics can be mapped to the resulting ‘common’ RDF.
Figure 2. An example graph for “705 Olive Street, St. Louis, Missouri”
None of the existing tools could be implemented “out of the box” to map from local RDF to the kind of common RDF we need. Therefore, for our initial prototype we chose to develop our own mapping tool. The process of developing our own mapping tool for the prototype pipeline has helped us develop more precise requirements that can be used when we begin to evaluate some of the existing tools. Our mapping tool makes use of a mapping language presented in recent research related to semantic data integration (Kondylakis et al., 2006; Lourdi and Papatheodorou, 2008). This mapping language is flexible enough to meet our needs for transforming literals into graphs, and it can represent the kind of simple to complex conversions that we require.
The common RDF statements or triples (subject, predicate, and object) that are the result of the mapping described above should be saved to a reliable store with context using a reliable quad store. The triple statement plus the context is often referred to as a quad (or named graph). The context associated with a statement will allow us to make judgments about the quality of the assertion. For example, an assertion made by a museum curator would have greater validity than an assertion made by an anonymous user. There are several quad stores available including AllegroGraph, Virtuoso, Jena, Mulgara, and Sesame. At MHM we use the Sesame quad store. Ideally all communication or data integration with the quad store would be standard (for example, using SPARQL). However, the context portion of the statement is relatively new, so it is not yet part of the SPARQL standard. Therefore, to abstract the writing of data to the quad store, we have developed a prototype data service to make any calls to the quad store.
Step 4. Index common RDF to a Solr index using dynamic fields.
This part of the process relies heavily on the Apache Solr search engine (http://lucene.apache.org/solr/). Solr is a very fast search index that supports advanced searching features such as facets, spell suggest, and ranking, as well as established text searching. When properly configured, search results are normally returned in fractions of a second—even when the index contains hundreds of thousands of records. For cross-collection search, Solr’s speed is a great advantage over federated searching or most database-backed searches. Solr indexes are configured by building a schema where field types, fields, and facets are defined. Solr’s indexing schema is very flexible, allowing for the creation of any custom search fields and facets. This flexibility has made it a good choice for indexing diverse data sources.
At MHM, we are already using Solr in our current cross-collection search; however, the current indexing process is less than ideal because it relies primarily on text-based data (or literals), thereby making no connection between logical subjects, places, types, or entities. By indexing to Solr from the common RDF that results from the mapping described above, we can overcome much of the ambiguity resulting from a text-based index. The process of indexing Solr from our common RDF relies on using facets that are linked to entities (assigned to URIs) and using dynamic fields in the Solr schema.
The prototype search index for cross-collection searching at the MHM has five broad level facets:
- Who: linked to person-org entities (for example, http://collections.mohistory.org/id/person-org/912837)
- What: linked to our types vocabulary (for example, http://collections.mohistory.org/vocab/types/dress)
- When: linked to events entities and dates (for example, http://collections.mohistory.org/id/event/675849)
- Where: linked to place entities (for example, http://collections.mohistory.org/id/place/013498)
- Why: linked to our topic vocabulary (for example, http://collections.mohistory.org/id/topic/726354)
A cross-collection search based solely on these broad facets may be of interest to some casual users who may be asking questions such as, “Are there any resources related to my house?”, “What resources are related to my grandfather?”, or “Are there any resources related to the Camp Jackson Affair?” However, such broad facets would be unlikely to satisfy a more serious researcher who may be asking a question such as, “Who were the photographers responsible for photos depicting the Camp Jackson Affair?” Of course, Solr search allows one to combine facets, so even with these limited broad facets it would be possible to combine facets to narrow search results: for example, limit the when facet to the “Camp Jackson Affair” and limit the what facet to photograph. But that kind of narrowing would likely be cumbersome for a serious researcher. A serious researcher would expect more context through relationships such as “photographedBy” and “depicting.” As described above, this kind of context can be captured by the common RDF created in step 3 of the pipeline. This level of context can be indexed to Solr by making use of dynamic fields.
Dynamic fields in Solr allow one to define a class of fields or a kind of wildcard field such as *_who. When dynamic fields are defined in the Solr schema, it is not necessary to define all of the fields that can be indexed. For example, if the dynamic field *_who is defined in the schema, it is possible to index fields such as “photographedBy_who” and “manufacturedBy_who.” By adding a copyField statement instructing Solr to copy the contents of the dynamic field to a broad facet, such as the who facet, it is possible to search by a more specific context within that broad facet.
MHM makes use of dynamic fields to capture specific context by the following steps:
1) In the Solr schema, we created broad facets using fields: who, what, when, where, why. When indexing, only URIs are indexed to the broad facet fields. These fields are of type “string” so they are not tokenized and are searchable by the full URI.
2) We created the following dynamic fields: *_who,*_what,*_when,*_where, and *_why. These fields are also intended to only accept URIs, so they are also of type string. Context-specific predicates are indexed to these dynamic fields and copied to the related broad facet depending on type. For example, mhm:/vocab/photographedBy would be indexed to “photographedBy_who” in Solr and copied to the who field in Solr.
3) Additional dynamic fields were created to handle literal values so they can be searched directly without opening the RDF resource to get the label: *_who_label,*_what_label,*_when_daterange,*_where_geohash,*_why_label. The label fields are of type “text” to allow for tokenizing and keyword searching. The daterange field is of type “date,” and the geohash field is of type “string.” When indexing the fields, we must perform a secondary query on URIs to get indexable values such as rdfs:label or skos:altlabel.
The data pipeline described above makes it possible to convert heterogeneous data from various source collections into common RDF that provides linked open data and finally provides a search index that allows searching at both a broad level and at a level specific to the context of a source collection. It might be possible to set up such a pipeline for a one-time snapshot of existing data. But to provide ongoing “live” access to data in the foreseeable future with opportunities to continuously build on the search index, there are some practical issues to consider.
First, maintaining and building upon the pipeline would require sufficient in-house technical expertise—both technological expertise and expertise with metadata. The technological tools that MHM uses for its pipeline require knowledge of server side scripting, command-line tools, java servlet engines, and data encoding. The data-mapping process requires staff with expertise in developing ontologies and managing metadata, and an understanding of various data formats. At MHM, building the prototype data pipeline required close collaboration between Technology Department staff and staff from various museum and library departments. As we move into full implementation, we have plans to dedicate staff to the data-mapping process—at least one metadata expert and one systems/data specialist.
Second, updates to the common RDF repository must be done with care. An RDF triple store (such as Sesame) does not have the kind of data validation typically found in a relational database management system, including validating data types and uniqueness. The triple store will accept triples with any type of object values in the “subject, predicate, object” triple. There is no way to limit types to a certain class of URIs, a certain type of literal, or even whether the value is a literal or URI. Similarly, there is no way in existing triple stores to define a subject–predicate combination as being unique within a given subject. Since there is no built-in validation in the triple store, it is up to those responsible for the mapping process and scripting to validate values written to the triple store. Updates to RDF should make use of a context URI. For example, when data is mapped to a common RDF repository from a specific local RDF repository, it should include a context URI that is specific to that local repository. To update triples from a local repository without deleting all of the triples, first the existing triples have to be removed by reference to the context; then the new triples can be rewritten from the same context.
Finally, to allow for flexibility and changes in technology, the different processes that make up the pipeline should be only loosely coupled to the other processes. A process that relies on reading triples from an RDF store can be loosely coupled when access is through a conformant SPARQL endpoint (http://www.w3.org/TR/rdf-sparql-protocol/). More and more datasets are being published as SPARQL endpoints on the web, and there is no indication that the growth in SPARQL endpoints is leveling off (Cyganiak, 2011). That continued growth means that we can expect more data to be available through SPARQL endpoints in the next few years, and an increasing number of tools should become available for managing SPARQL endpoints. In the pipeline presented above, there are two points where data is read from SPARQL endpoints: 1) when mapping from a local RDF store to common RDF; and 2) when indexing from common RDF to Solr. The underlying application that delivers the SPARQL endpoint (in this case Sesame) is not necessary for the long-term maintenance of the pipeline—it could be replaced with another application that also delivers a conformant SPARQL endpoint.
The prototype RDF pipeline presented here meets some of the goals outlined earlier in this paper, but not all. With the current prototype, there are challenges incorporating new diverse data sources; and the prototype does not yet provide an interface that allows users to make meaningful connections between resources and entities in the index.
In the pipeline presented here, mapping local RDF data to common RDF is a difficult manual process that requires a lot of knowledge of the source data, the target data, and parsing functions. The data map is manually written to an XML mapping file. To streamline the process of mapping data, we need an application that allows more automated references to source data, target data, and parsing functions; for example, by using auto-suggest or visual representations. While we expect that the mapping process will always require a certain amount of expertise in managing metadata and an understanding of the specific vocabularies being used, there should be opportunities to make the process more efficient. It may be possible to configure one or more of the mapping tools listed under step 4 above to provide this functionality. Further evaluation is needed to determine which, if any, of these applications would allow for an efficient mapping process that also meets our strict requirements for mapping.
The biggest challenge in providing an interface that allows users to make meaningful connections between resources is to incorporate that capability in the search interface without overwhelming the user with too much information and too much complexity. To overcome this challenge, we need to rely on the progressive disclosure strategy—showing only what is necessary for a given level of involvement. The user who is only interested in searching and seeing search results should not be presented with a lot of information and choices about creating linkages between resources. Instead, we may use an icon similar to the ShareThis (http://sharethis.com/) icon to allow users to make such connections. The interfaces related specifically to making linkages should be easy enough to follow so that the user does not need any understanding of the underlying vocabularies and RDF. Using auto-suggest and other search tools should help simplify these interfaces.
There are significant challenges to implementing a system that provides search across diverse collections in a way that is useful for the general public and serious researchers. This paper has presented an approach to delivering a semantic cross-collection search by leveraging emerging semantic web technologies—specifically those related to the Resource Descriptive Framework (RDF). By leveraging these emerging technologies, it is possible to create a flexible infrastructure that can adapt to new technologies as needed.
To build upon the approach presented here, future work should focus on developing user-friendly interfaces that can make the best use of the semantic cross-collection search index, and simplifying the tools used in the pipeline to accommodate institutions of all sizes and types.
Anderson, J., & L. Rainie. (2010). The Fate of the Semantic Web. Consulted January 25, 2012. Available at http://pewinternet.org/~/media//Files/Reports/2010/PIP-Future-of-the-Internet-Semantic-web.pdf
Berners-Lee, T., et al. (2001). “The Semantic Web.” Scientific American (May). Consulted January 25, 2012. Available at http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf
Cruz, I., & H. Xiao. (2005). The Role of Ontologies in Data Integration. Consulted January 24, 2012. Available at: http://www.cs.uic.edu/~advis/publications/dataint/eis05j.pdf
Kondylakis, H., et al. (2006). Mapping Language for Information Integration. Consulted January 24, 2012. Available at: http://www.cidoc-crm.org/docs/Mapping_TR385_December06.pdf Lourdi, I., & C. Papatheodorou. (2008). Semantic Integration of Collection-level Information: A Crosswalk Between CIDOC/CRM & Dublin Core Collections Application Profile. Consulted January 24, 2012. Available at http://www.ionio.gr/~papatheodor/papers/cidoc2008.pdf
Prescott, L., & R. Erway. (2011). Single Search: The Quest for the Holy Grail. Dublin, Ohio: OCLC Research. Consulted January 24, 2012. Available at: http://www.oclc.org/research/publications/library/2011/2011-17.pdf
Tauberer, Joshua. (2008). What is RDF and what is it good for? Last revised January 2008. Consulted January 24, 2012. http://www.rdfabout.com/intro/?
W3C (World Wide Web Consortium). (1999). “Resource Description Framework (RDF) Model and Syntax Specification.” O. Lassila and R. Swick (eds.). Available at: http://www.w3.org/TR/PR-rdf-syntax/