1. Introduction
We report on the LODAC (Linked Open Data for ACademia) project in the National Institute of Informatics, which is a part of the project at the Transdisciplinary Research Integration Center (http://www.rois.ac.jp/tric/) in the Research Organization of Information and Systems. Our purpose is to build an information distribution system that can share and publish a wide range of data using linked open data, especially for use as an academic resource for these domains and for society in Japan.
In this paper, we introduce the prototype system (LODAC-Museum) designed to aggregate information across multiple resources. We identify and associate artists and artwork from different museum collections to provide integrated views of them. Each museum in Japan has built a database system and a digitized collection of museum items. The databases are important in intelligent federal infrastructure programs and the e-japan2002 priority policy program initiative of the central government. Many of the museums were affected by the Japan Digital Archives Association (broken up in 2005). However, there was a small effort in the museum metadata field to use a standard data model for collection.
Since Japanese museums have developed unique collection systems with the original metadata schema for structure of data so far, it is difficult to retrieve relevant information by searching multiple museum databases. To solve this problem ,the Agency for Cultural Affairs has directed the development of a common index search system called a "Cultural Heritage Online". But it does not solve all the problems since we have still a lot of collections to be digitized and integrated in the arts and culture fields in over 5,773 museums in Japan. Meanwhile, several documentation models appear in these fields, especially the CIDOC CRM (ISO 21127:2006) (M Doerr et al., 2006) intended to support and be used in cultural heritage documentation. It became ISO standard in 2006 and is currently providing an extensible semantic framework which can be used to describe any cultural heritage information (Ceri et al,. 2008).
In Japan, Tokyo National Museum published "Structured Model for Museum Object Information (http://webarchives.tnm.jp/docs/informatics/smmoi)" which is based on the concept of CIDOC CRM. This model is to support cultural heritage and museum information management. However, these models are seldom applied to museums in Japan. The models are difficult for curators or museum staff to use, since the structure doesn't fit each museum. Additionally, a lot of small museums still cannot build databases, or have pending arrangements for a collection list. This means that there are still undiscovered cultural information resources. We need to organize, publish, and share our cultural information resources comprehensively.
2. Open cultural information resources
Recently, there are a growing number of organizations building crossover search systems to enable multiple museum database searches. For example, "Shimane Prefecture Virtual Museum (http://www.v-museum.pref.shimane.jp/)" and "Union Catalog of the Collections of the National Art Museums (http://search.artmuseums.go.jp/)" provide several museum collection searches at once. However, their system structure is unfit to use for data referred from other systems and cannot be used as linkage data. For example, material-A is the museum in Tokyo; material-B is the museum in Nara, which is part of material-A. If we want to look at each for data, we have to repeat the search and interpret the search results. We suggest that different types of resources be linked together on the Web, to be integrated as one piece of data if needed. We need to construct a platform of resource usage for an open and circulating system to be used by the community.
A traditional information distribution system consists of the following distinct processes:
- Use and Create (Uses existing information and artwork for individuals)
- Publish (Published with HTML or image, all in media format).
In contrast, we expect the following five processes to form the information cycle :
- Use (Use for group activity, specialized techniques, professional point of view).
- Publish (Published with Linked Open Data policy).
- Collect (Collect metadata, resource information, thesauri).
- Share (Shared relevant data, common metadata and linked similarity information).
- Create (Create a new frontier or artwork with an integration approach on the Web).
Fig 1: Platform of Information cycle
We need a new information platform to find new knowledge or create a new field of study in the culture and arts sector.
3. A LODAC-Museum Approach
The Purpose of LODAC-Museum
We are attempting to solve the following culture and arts field problems by constructing a LODAC-Museum with Semantic Web and Linked Open Data technology (Takeda, 2008; 2010).
- The museums in Japan maintain and publish museum information with the original metadata schemata. This leads to difficulty in crossover searching of museum information. That is to say, during information retrieval, we only get information in fragments. They are not consolidated. Thus, we need to integrate information from several sources by using Linked Open Data. It is the goal for the prototype system.
- We suppose that finding new knowledge and discovering a search methods can be happen not only by using museum collection data, but also by using different sources; for example, library catalog data, thesaurus, specialized terminology, and GIS data.
- There is a need to improve the liquidity and flexibility of information distribution in Japan. The created Linked Open Data system links information available on diverse sites on the Web. It can be also a platform to give comments back to each data provider.
Data sources
The prototype system uses existing museum collection data, thesauri, and other types of information. Table 1 shows the data sets which are included in the implemented prototype system. These data were collected and scraped from each museum Web site except the Thesaurus of Japanese Art.
The Thesaurus of Japanese Art contains Japanese art information such as artwork, creator, work title, era, owner and so on that was built by the Tsukuba university research group (Fukuda & Omuka, 1997). We use the thesaurus as the basic information to integrate other sources with Linked Open Data format..
Source | Resource Type | Number of data items |
---|---|---|
National Art Museum | Artwork | 25180 |
The National Museum of Western Art | Artwork | 4373 |
Kyoto National Museum | Artwork | 5819 |
Nara National Museum | Artwork | 431 |
Fukushima Pref Art Museum | Artwork | 20 |
Tochigi Pref Art Museum | Artwork | 32 |
Akita Pref Modern Art Museum | Artwork | 22 |
Iwate Pref Art Museum | Artwork | 1588 |
Tokushima Pref Modern Art Museum | Artwork | 18482 |
Yamanashi Pref Art Musuem | Artwork | 262 |
Kagawa Pref Higashitama Kaii Setouchi Art Museum | Artwork | 5416 |
Thesaurus of Japanese Art | Artwork | 266 |
Thesaurus of Japanese Art | Person | 3800 |
Thesaurus of Japanese Art | Group | 1332 |
Thesaurus of Japanese Art | Museum Information | 289 |
Cultural Heritage Online | Museum Information | 648 |
Database for National Treasure & Important Cultural Property of National Designated | Artwork | 915 |
DBPedia Japan | (Rerfferd to DBPedia Japan) | - |
GIS data National and Regional Planning Bureau | GIS data (Currently pending) | - |
TOTAL | 103096 |
Table 1: Information resources
Data integration
Library data is organized as bibliographic entity and authority information that describes authors. For example, the National Diet Library (Japan) separates and categorizes these types of information to describe each relationship (Nagamori & Sugimoto, 2006).
In contrast, museum data has no established authority information, so that it is difficult to integrate information collected from different museums. LODAC-Museum generates datasets which are scraped from individual museum websites, and then transformed into Linked Open Data format. We cannot operate data sources nor change data contents since the authority of the data is attributed to individual data sources. LODAC-Museum manages a couple of different type of resources. One resource is called "REF-resource" (code following).
<http://lod.ac/ref/18731>
<http://lod.ac/ns/lodac#exhibitionHistory> "個展(東京、東京画廊1969)";
<http://lod.ac/ns/lodac#genre> "Prints:", "版画:";
<http://purl.org/NET/cidoc-crm/core#P62I_is_depicted_by> "右下に署名(刷)";
<http://purl.org/dc/elements/1.1/creator> "横尾忠則";
<http://purl.org/dc/terms/created> "1969", "昭和44";
<http://purl.org/dc/terms/extent> "90.0×90.0", "on paper, acrylic films and acrylic sheet90.0×90.0";
<http://purl.org/dc/terms/identifier> "P01847";
<http://purl.org/dc/terms/isReferencedBy> <http://lod.ac/id/18731>;
<http://purl.org/dc/terms/medium> "silkscreen", "シルクスクリーン・紙、アクリルフィルム、アクリル板・1";
<http://purl.org/dc/terms/provenance> "平成17年度購入P01847";
<http://purl.org/dc/terms/source> <http://search.artmuseums.go.jp>;
<http://purl.org/dc/terms/title> "Landscape No.1 Girl","風景 No.1 女の子";
a <http://lod.ac/ns/lodac#WorkReference>;
<http://www.w3.org/2004/02/skos/core#prefLabel> "Landscape No.1 Girl", "風景 No.1 女の子" .
It is scraped from each museum Web site and identified with a unique number provided by the original information source. The other resource is called "ID-resource" (following code).
<http://lod.ac/id/18731>
<http://purl.org/NET/cidoc-crm/core#P55_has_current_location> <http://lod.ac/id/912>;
<http://purl.org/dc/terms/creator> <http://lod.ac/id/874>;
<http://purl.org/dc/terms/references> <http://lod.ac/ref/18731>;
<http://purl.org/dc/terms/title> "Landscape No.1 Girl","風景 No.1 女の子";
a <http://lod.ac/ns/lodac#Work>;
<http://www.w3.org/2004/02/skos/core#prefLabel> "Landscape No.1 Girl",
"風景 No.1 女の子" .
It is defined in a similar way to "Ref-Resource"; however, it is given with the LODAC original identifier, and the identifier functions to describe the relation between "Ref-Resources". We provide "Ref-Resource" and "ID-Resource" for artworks and creators. As the result, the responsibility for "ID-resource" alone is attributed to us.
Core data object
Since we use several resources with Linked Open Data, we need to provide core data as a hub for integration. We decided to generate LODAC-Museum core data objects based on Japan Art Thesaurus which contains a lot of different kinds of arts information. It covers a wide range of content types as already structured concepts. We use it to discover relationships between different information resources. The core data object includes creator, work title, owner (place of museum) in that context. It provides individual ID-resources with minimum necessary metadata. In contrast, a link from related ID-resource information to a Ref-Resource is generated to show that the resource is the same. Some artwork ID-resources have multiple links to Ref-Resources with "dc: references" (Figure 2).
Fig 2: Integrated data and each resources
Vocabulary mapping
LODAC-Museum does not describe detailed vocabulary for artwork and other contexts. We just describe commonality elements. In fact, we reference from elements of the scraped data: person name, size, title and genre are re-mapped to commonality vocabulary. If we want to realize crossover searches or integrate data, we need to have a common vocabulary. Table 2 summarizes vocabularies used in LODAC-Museum.
PREFIX | URL |
---|---|
crm | http://purl.org/NET/cidoc-crm/core# |
dc | http://purl.org/dc/terms/ |
dc11 | http://purl.org/dc/elements/1.1/ |
foaf | http://xmlns.com/foaf/0.1/ |
skos | http://www.w3.org/2004/02/skos/core# |
rdfs | http://www.w3.org/2000/01/rdf-schema# |
ical | http://www.w3.org/2002/12/cal/ical# |
rda2 | http://RDVocab.info/ElementsGr2 |
lodac | http://lod.ac/ns/lodac# |
Table 2: Vocabulary list
Procedure of identification from different data sources
Two Ref-Resources can turn out to be identical. In this case, we invalidate one of the ID-resources associated to these Ref-Resources and make two Ref-Resources share the unique ID-Resource. The procedure is the following:
- Suppose that ID-Resource-A is associated to Ref-Resource-A, and ID-Resource-B to Ref-resource-B. If ID-Resource-A turns out to be identical to ID-Resource-B, remove the link between Ref-Resource-B and ID-Resource-B.
- Re-link Ref-Resource-B to ID-Resource-A
- Add primary metadata information on Ref-Resource-B to ID-Resource-A.
- If ID-Resource-B is accessed, it is redirected to ID-Resource-A.
- As a result, two Ref-Resources become a single entity. The integrated ID-Resource-A is now accessible from both Ref-Resource-A and Ref-Resource-B.
Problem of naming
When we create Linked Data on art and culture information, one of the important pieces of information is names of people. An artist may be known by different names; for example not only her real name, but pen-name, screen-name and so on. For this issue, we use foaf:nick to represent alternative names (Table 3).
Person Reference | Property |
---|---|
Creator (popular name) | foaf:name / skos:prefLabel |
Creator (pronouncing) | foaf:name @ja-hrkt / skos:altLabel |
A peculiar title | foaf:nick |
A peculiar title (pronouncing) | foaf:nick @ja-hrkt |
Creator (English) | foaf:name @en / skos:altLabel |
Table 3: About persons' property
There is another problem for naming. The Kana pronunciation is a peculiarly Japanese problem. So, we decided to use language tags to solve this problem (see the following example). It also happens in other metadata vocabulary such as title.
foaf:nick [
a lodac:Name;
lodac:label "武田"@ja;
lodac:label "たけだ"@ja-hrkt;
lodac:label "Takeda"@en;
] .
4. State-of-the-art of prototype system
In this section, we show the state-of-the-art of the prototype system. We generated 529,229 resources and a total number of 1,915,586 triples (containing blank nodes) which are stored "4store" in an RDF database (see also Table 1).
We processed the data obtained to identify identical objects. At first, we integrated museum facility resources that overlap between "the Thesaurus of Japanese Art (648 triples)" and "Cultural Heritage Online" by extracting matched titles. As a result, we got 77 integrated resources of facilities for museum information.
For example, there are two different Ref-Resources "http://lod.ac/ref/3341/" and "http://lod.ac/ref/8057/" corresponding to "Kyoto Bunka Hakubutsukan". Ref-resource "http://lod.ac/ref/3341/" is taken from "the Thesaurus of Japanese Art" and associated to artwork, address and phone-number information. On the other hand, a Ref-resource "http://lod.ac/ref/8057/" is from "Cultural Heritage Online" and associated to business day, opening hours, and access information. We associate these Ref-Resources to ID-Resource (http://lod.ac/id/3341) and furthermore associate individual artworks stored in the museum to the ID-Resource. Then all the associated data on the museum is available from the ID-Resource.
Fig 3: Screen view of ID-Resource
Second, we show how artworks are associated via identified creator names. Figure 3 shows ID-Resource and Ref-Resource for a famous Japanese painter named "SHIMOMURA Kanzan". ID-Resource is associated to a creator's resource property "dc:reference" and 14 artwork resource properties "lodac:creates". dc:reference property are linked to Ref-Resource "http://lod.ac/ref/359" (Figure 4).
Fig 4: Screen view of Ref-Resource
This Ref-Resource describes data for "SHIMOMURA Kanzan" which is taken from the Thesaurus of Japanese Art. It includes two artwork resources (http://lod.ac/ref/3762 and http://lod.ac/ref/4588). On the other land, we discovered 12 artwork information items from integrated data from 6 different sources. As a total, 14 artworks are identified as his work and associated with his ID-resource. Table 4 presents the facts of the integrated data.
Integrate Context | Source | Number of Data | Integrated Data |
---|---|---|---|
Name of museum | Thesaurus of Japanese Art (All contents) | 648 | 77 |
Cultural Heritage Online | 915 | ||
National Treasure (Title) or Important Cultural (Title) | Thesaurus of Japanese Art (Artwork) | 3800 | 74 |
DB for National Treasure & Important Cultural Property of National Designated (Artwork) | 10115 | ||
Person name and artwork title (Multiple integrate) | Thesaurus of Japanese Art (Person) | 1332 | 15020 |
Each museum (Artwork) | 61861 | ||
Person name | Thesaurus of Japanese Art (Person) | 1332 | 615 |
Each museum (Artwork) | 61861 |
Table 4: The facts of the integrated data
5. Conclusion
This prototype system indicates the great capability of Linked Data to integrate museum data which is naturally distributed. In particular, generating collections for individual artists from different sources is practically useful both for museum staff and audience.
There are still many issues to be solved, for example:
- Aborted integration: It is often caused by mistakes in the original sources. In this case, it is better to feedback it to the data provider and ask for proofreading.
- Multiple creators' names: We need to reference a good artist thesaurus.
- Fluctuation of description: It is serous to integrate data. We need a good reference database for abbreviation
- Updating resources: Since LODAC-Museum's resources are scraped from Web sites, regular updates or on-demand updates are needed.
- Links to other sites: It is useful to have links to other data sites such as the National Diet Library (Japan) and Europeana.
- Easy participation: We need to provide easier ways to participate in Linked Open Data; such as importing from CSV format data. It is important in particular for small museums.
We believe that further study of cultural information resources with Linked Open Data will be really beneficial to all the fields of arts and culture.
6. Acknowledgements
We thank the collaborators, including the LODAC project and KASM (Knowledge-as-Media Research Group) for valuable discussion. We also thank Toshiharu Omura (University of Tsukuba) and Hakudo Fukuda (Atomi University) for allowing us to use the Thesaurus of Japanese Art.
7. References
Binding, Ceri, Keith May, and Douglas Tudhope (2008). "Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction via the CIDOC CRM". Research and Advanced Technology for Digital Libraries. 280-290.
Doerr, Martin, Chair,Heraklion (2006). The CIDOC CRM. Consulted January 31, 2010. Available http://www.cidoc-crm.org/
Fukuda, Hakudo, and Omuka Toshiharu (1997). "Some problems to make the fine arts thesaurus database". Information & documentation, Vol.40. No.9. 790-809. (In Japanese).
Mitsuharu, Nagamori and Sugimoto Shigeo (2006). "Representing National Diet Library Subject Headings (NDLSH) in SKOS and its Graphical Browser". IPSJ SIG technical reports, 2006(118), 11-19. (In Japanese).
Takeda, Hideaki (2008). "Semantic Web and Linked Data", IEICE technical report, 108(316), 25-28. (In Japanese).
Takeda, Hideaki (2010). Special Interest Group for Semantic Web and Ontology. Last updated 27-07-2010. Consulted January 31, 2010. Available http://sigswo.org/A1001_program.html (In Japanese).
Tokyo National Museum Research Project Team (2005). Structured Model for Museum Object Information. 2005, last updated 16-12-2005. Consulted January 31, 2010. Available http://webarchives.tnm.jp/docs/informatics/smmoi/ (In Japanese)