Skip to main content

Museums and the Web

An annual conference exploring the social, cultural, design, technological, economic, and organizational issues of culture, science and heritage on-line.


Aaron Straup Cope, Near Future Laboratory


This paper will discuss the motivation behind the building=yes project, its technical implementation and shortcomings as well as future directions. It also addresses the increasingly large number and role of community-driven bespoke registries of cultural artifacts. The fact the Internets make it possible for people to self-organize around a common interest effectively collapses the costs usually associated with production and distribution; if communities of amateurs don't feel like they have any avenues to participate in established projects or institutions, then they can and will just do it themselves.


building=yes ( is an open source project [1] started in 2010 to create stable and unique identifiers, and a permanent web page, for each of the 26 million (and counting) buildings listed in the OpenStreetMap (OSM) ( project.

Each building has associated with it an accurate geographic footprint and is situated within a hierarchical list of unique place IDs using the Yahoo! GeoPlanet ( [2] gazetteer. Buildings may be queried by tag searches, by full-text search, by geographic location and proximity and by place ID. In addition, other nearby buildings are listed.

Many of the records also have meaningful metadata including name, architect, physical dimensions (height) and other site specific details – although the majority do not. Those records without metadata areare seen as a valuable opportunity to encourage increased participation and individual stewardship of one or more buildings rather than a failing of the dataset. Projects like OSM and building=yes demonstrate that museums and libraries and archives are no longer the only institutions capable of collecting, housing and organizing cultural heritage.

In January 2010 the advisory board of Built Works Registry (BWR) ( held its first meeting at the offices of ArtStor ( in New York City. [3] I was asked to do a short presentation, titled “Pass the Corbusier,” to the board outlining the state and practice of community-driven locative projects on the Internet. It is impossible to talk about community projects and location without talking about Open Street Maps (OSM).

Founded in 2007, OSM is a community-based project to create a collaborative and freely licensed map of the world “one neighbourhood at a time”. OSM has a deliberately simple data model consisting of two primary types: nodes and ways. Nodes are “points” on the Earth. Each node has a unique ID and latitude and longitude associated with it. Ways are collections of nodes that form an atomic “thing” in the world, say a building or a bridge or a highway. Every way is also assigned a unique ID.[4] Both nodes and ways have zero or more tags associated with them. A tag has a key (which can be thought of as a domain or namespace) and a value. There is no limit on the number of tags a node or way can have.

Since its creation, OSM has gone on to produce a map whose quality in places like the United Kingdom and Germany rivals commercial as well as publicly-funded maps. The maps of Haiti produced following the earthquake in 2010 are now considered to be authoritative and used by both the United Nations and World Bank.

To emphasize the ease with which any one can participate in OSM, creating new records or editing existing ones, I added the building that houses the ArtStor office at 151 East 61st Street in New York City. I did this by adding four points (nodes) to a map and then grouping them in to a single unit (a way) tagged building=yes.

James Shulman, president of ARTstor remarked at the time: "[W]hat seems like a fairly bland, renovated townhouse on the upper east side, now housing ARTstor and another non-profit was originally a townhouse belonging to Peggy Guggenheim, and the best legend about the house that I've heard was that she commissioned Jackson Pollock to create a mural for the 4th floor. But when he finished it, it was a foot too long to fit on the wall and so she ended up giving it away. I can't testify to whether this is true or not, but it's a good story about the place..." 

This experience had three effects:

First, that most of our histories happen behind the walls of otherwise and so-called “unremarkable” buildings.

Second, it got me wondering how many other buildings had already been added to the OSM database?

Finally, it affirmed a theory that in the absence of any other means to participate in the process of cataloging cultural heritage and memory, people can and will just do it themselves. wrote in Authority Records, Future Computers and Other Unfinished Histories (2011):[5]

They will self-organize. This is what the Internet has taught us. That it is the fastest cheapest bridge we’ve ever seen for collapsing the barriers of collecting, vetting and redistributing data.

Eventually, if a project gets off the ground (not all do) it will exist not just as an alternative to yours but in opposition to it. Once that happens any mistakes they make will be treated as badges of honour. And they will make mistakes, many of them the same mistakes you’ve made over the years and wouldn’t wish on your worst enemies. But they will also fix them. And in fixing them they will celebrate their resilience and their ability to nurture a collaborative project that can survive those mistakes.

building=yes, the website, then was an attempt to make concrete some of these ideas and to see what a registry containing 26 million user-contributed buildings looked like.

Technical Implementation

There are four separate but related technical components to building=yes:

  • Retrieval: downloading and extracting building data from OSM;
  • Processing: processing and importing the data;
  • Publishing: publishing the data as a searchable and browsable website;
  • Cartography: generating an effective cartography for displaying the buildings.


OSM provides a free and public download of its entire database distributed as a single compressed XML file, called the "planet.xml" file [6] (or sometimes just "the planet") that can be downloaded from the web.  As of January 2012 the compressed planet.xml file is 19 GB and 300 GB uncompressed.  There are a number of tools available for parsing the file and importing the data in to a PostGIS spatial database including one called Osmosis. The current version of Osmosis is written in Java and maintained by Ian Dees.

One of Osmosis's most useful features is the ability to prune the "planet.xml" file and extract only those nodes and ways matching one or more tag searches or geographic queries.[7] For example, to extract only those ways containing a building= tag in to a new file called “buildings.osm” you would issue the following command:

$> bzcat planet-latest.osm.bz2 | ./osmosis-0.39/bin/osmosis \
   --read-xml file=- --tf accept-ways 'building=*' --used-node \
   --write-xml file=buildings.osm  

Another option is to use the newer and still being developed Osmfilter application which has a smaller memory (RAM) requirement that Osmosis. However because Osmfilter is only able to read uncompressed “planet” files you will need to ensure that you have sufficient disk space to extract the raw data before you begin. To filter out only those ways with a building= tag in to in to a new file called “buildings.osm” you would issue the following commands:

$> bunzip2 planet-latest.osm.bz2
$> osmfilter planet-latest.osm --keep= --keep-ways=building= \
   --drop-relations -o=buildings.osm

As of January 2012, the Osmfilter method produces a 59GB file containing approximately 49.5 million building records! The extra 22 million records are accounted for by the fact that when launched the building=yes project only filtered on ways with a literal  building=yes tag.


The resultant XML file was then parsed again and all nodes and ways were stored as individual rows in a SQLite database. The choice of a SQLite database was primarily because it uses a simple file-based datastore that required no additional installations or monitoring, is supported by a wide range of programming languages and that allowed the work processing the data to be completed in incremental phases. It is not clear that this remains the correct approach for any future work given the final size of the database (16 GB) and number of file-system writes required to populate it.[8]

Subsequently the centroid (or geographic center) for each building’s footprint was calculated by processing the collection of associate nodes using the Shapely Python library and stored in the "ways" table in the SQLite database.

Once the geographic center for each building was calculated it was then "reverse-geocoded" using the Flickr API to convert its latitude and longitude into a series of unique place IDs and human readable names. The Sheraton San Diego Hotel and Marina, site of the 2012 Museums and the Web conference is located at latitude 32.715 and longitude -117.157. The Flickr reverse geocoder will tell you it is located in the neighbourhood of Core or WOE ID #29389024; the city of San Diego or WOE ID #2487889; the state of California or WOE ID #2347563; the United States or WOE ID #23424977.

This makes it possible to search for and retrieve buildings in a specific neighbourhood or locality without requiring a geo-spatial database or even storing the complex geographic data associated with those place, much of which is not available publicly or is licensed at a financially prohibitive cost when it is. For example, these four URLs link to the building located within the hierarchy of places listed above:

This was by far the most time-consuming part of the project, taking approximately three months to complete, even factoring in efficiencies like trimming coordinate data down to three decimal points and aggressive caching. One reason the process took so long was that two requests against the Flickr API were required for each latitude and longitude: one to determine the primary place, or WOE ID for the building  and a second to lookup the hierarchy of parent locations for that WOE ID.

The reverse geocoding remains the most brittle piece of the equation since it relies on continued access to a third party service operated freely and with no contractual obligations. Although the GeoPlanet dataset (which is used by the Flickr reverse geocoder) is publicly available under a Creative Commons license it lacks spatial data. The same is true of many other openly licensed datasets including Geonames (

The code to do the reverse-geocoding described above including both server-side and client-side implementations, each of which cache data in MySQL and in-memory databases respectively, has been published as open source software. In addition the cached dataset from the initial reverse geocoding of the OSM building dump has been released under a Creative Commons Zero license. These do not export a full hierarchy of ancestors but a truncated version limited to neighbourhood, when available locality, region and country.


Once the reverse-geocoding was completed, the data was indexed using a Solr database. Solr was chosen over a traditional spatially enabled relational database (RDBMS), like PostGIS, because it supports basic spatial functions like radial queries in addition to being able to perform sophisticated free-text based indexing and result faceting, neither or which are available in an RDBMS.

Rather than indexing the entire building footprint, only the centroid was indexed for spatial queries allowing users to search for "nearby" buildings.

OSM defines tags as key-value pairs (amenity=pub, highway=primary and so on), which are effectively Flickr-style "machine tags" with an implied osm: namespace. [9] With this in mind, all tags are stored as machine tags in Solr including the GeoPlanet hierarchy of places (WOE IDs).

Consider the following tags defined for the Ferry Terminal Building ( in downtown San Francisco. OSM exports the tag data for this building as:

<tag k="addr:state" v="CA"/>
<tag k="amenity" v="ferry_terminal"/>
<tag k="building" v="yes"/>
<tag k="ele" v="1"/>
<tag k="gnis:county_name" v="San Francisco"/>
<tag k="gnis:feature_id" v="223477"/>
<tag k="gnis:import_uuid" v="57871b70-0100-4405-bb30-88b2e001a944"/>
<tag k="gnis:reviewed" v="no"/>
<tag k="historic" v="landmark"/>
<tag k="name" v="Ferry Building"/><tag k="source" v="USGS Geonames"/>

Although OSM does not support or implement machine tags, officially, people often use a similar syntax to scope tags to a particular topic or data source. For example gnis: to indicate that the data is from the US Geographic Names Information System (GNIS) or addr: to indicate that the tag is part of an address.

building=yes extends this idea by adding an “osm” namespace prefix to those tags that don’t already have one and adding a woe: namespace and a placetype-related predicate for geographic locations. For example, the tags for the Ferry Building would be displayed as:

“osm” tags

“gnis” tags

“addr:” tags

“woe:” tags

The use of machine tags as a storage mechanism allows complex faceted queries while using only a single multi-value storage field in the database.

In addition to the unique OSM way ID associated with each building the site generates a unique 64-bit "building" ID.  The 64-bit identifier is used so that they can fit in to the WOE hierarchy of places, without accidentally stomping over any existing IDs which are defined as 32-bit integers. Therefore the very first building=yes ID is 32-bits + 1 (or 2147483648). All buildings have permanent URLs for their corresponding OSM way ID, building ID. [10]

Each of those UIDs can also be used as machine tags on third-party services like Flickr to denote that a photo represents one or more buildings. For example the photos in this screenshot for the record of the CCTV building ( in Beijing were tagged on Flickr with osm:way=33459516 :

The application layer of the site is built using a standard PHP + Apache setup and most of the code piggybacks on top of a software package called Flamework. Flamework is an open-source project maintained by a number of ex-Flickr engineers and aims to re-implement, from scratch, most of the core libraries and application models used to build Flickr itself.

Although the requirement of Solr as a datastore means that the site itself can not be run entirely on a shared web-hosting service. Flamework itself is designed to be a workable solution for these kinds of consumer-facing services as well as offerings like Amazon's EC2 virtual servers.


The website uses stylized aerial imagery as a canvas on which to display buildings. Traditionally satellite map tiles are pre-processed through a "dithering" filter[11] to give them the appearance of a black and white newspaper-style halftone image. The use of custom cartography was part of an effort to not let the map get in the way of the data itself but instead to provide a minimal geographic context.

The footprints of the individual buildings are drawn on top of the map tiles dynamically using JavaScript and the Canvas drawing libraries available in all modern web browsers. This allows buildings to be added and removed from the database without needing to regenerate base tiles and enabled a richer level of interactivity.

One short-coming of the existing cartography is that confuses the totality of all buildings, seen through the god's eye view of the satellite, with those buildings that have been added to OSM. This problem is exacerbated on "search" style pages where the total result set is paginated and it is unclear whether a building footprint is excluded because it is part of a different "page" or because it has not been added to OSM.  A good example of this problem is illustrated by the page for the San Francisco International Airport where the building outline for the Terminal 2 building is not shown until the second page of results:

In 2010, the same year that building=yes was released, Stamen Design released its map=yes project in collaboration with the online mapping service MapQuest  and points to an alternative and ultimately more flexible approach to creating custom cartographies for the building=yes site.

As part of their "Open Maps" initiative MapQuest has embraced the OSM project and committed its support to improve the data and to provide services around it. This includes running a publicly available, and up-to-date, instance of the OSM "extended API" (XAPI) service. The easiest way to think about XAPI is to imagine it as a network-enabled version of the Osmosis tool. Rather than operating on the raw XML as Osmosis does, XAPI instead provides a simple HTTP interface for querying a PostGIS database by bounding box and one or more tag filters.

Although OSM is a freely available dataset, it remains a non-trivial endeavour to build and maintain and requires a significant investment in time and hardware. By assuming that burden, MaqQuest makes it possible for users with fewer resources to interact with and experiment with OSM data. Rather than having to tackle the entire world, literally, they can use the XAPI endpoint to request data for a smaller geographic area specific to a project. And when you think about it, map tiles are just a series of “smaller geographic areas” that are neighbours. This opens up a whole new range of possibilities:

Most online maps are are designed to help you get around in a car. This generally means displaying: roads, businesses, buildings, on-ramps, parks, oceans and traffic congestion. Nothing wrong with that! Designers get handed a tool kit that has as many tools as a good swiss army knife, and the maps reflect these tools. Millions of people use them to make appointments across town, find restaurants, and drive home for the holidays.

But what if, instead of a swiss army knife, we used a box of crayons? Or charcoal and newsprint? Or play-doh? What would those maps look like? What could they tell us about the world? (Eric Rodenbeck, map=yes)

Just as you might prune the larger planet OSM file for only ways tagged building=* using Osmosis the XAPI endpoint makes it possible to perform the same sorts of filtering, albeit for lots of tiny map tile sized bounding boxes, and produce custom cartography containing at the same time.

To demonstrate this idea Stamen produced a set of open source tools allowing users to generate maps restricted to one or more tag filters, using the MapQuest XAPI endpoint. For example, all of the buildings and leisure areas in Paris or all of the buildings and on/off ramps near the airport in San Francisco:

The map tiles are generated dynamically using TileStache an open source map tile library written in Python, and the gunicorn server framework also written in Python. The tile server is not exposed directly to the Internet and requests are proxied through Apache, the same web server running the application logic itself.

TileStache uses a simple configuration file to define map layers (or "providers"). It should, therefore, be possible to create a new composite layer of all the buildings in OSM overlayed on "dithered" satellite imagery like this:

      "layers" : {
             "naip": {
                  "provider": {
                        "name": "proxy",
                        "provider": "MAPQUEST_AERIAL"
            "dithered": {
                  "provider": {
                        "class": "Atkinstache.dithering.Provider",
                        "kwargs": {
                              "source_layer": "naip"
            "buildings" : {
                  "provider": {
                        "class" : "mapequalsyes.footprint.Provider",
                        "kwargs" : {
                              "type" : "way",
                              "query" : "[building=*]",
                              "datasource" : "xapi"
            “buildingequalsyes” : {
               "class": "TileStache.Goodies.Providers.Composite.Provider",
                  "kwargs": {
                        "stack": [
                              {"src": "dithered"},
                              {"src": "buildings"}

The first provider proxies requests for MapQuest’s aerial imagery, using the built-in “Proxy” provider in TileStache. The second uses the first provider (“naip”) as its input and generates a dithered version (of the map tile) as its output. The third provider queries the MapQuest XAPI endpoint for OSM buildings contained by the bounding box for the request map tile and draws them on a transparent background. The fourth “Composite” provider combines the output of the second and third providers (“dithering” and “buildings”) in to a new tile. All of the providers, unless configured otherwise, cache their output to disk.

Map tiles generated using the XAPI interface might occasionally be out of sync with a local copy of OSM buildings but this seems like both an acceptable discrepancy and one that will likely be addressed by future plans/work for the project.

Next Steps

In addition to providing complete data dumps of the building=yes site, the most pressing next steps for the building=yes project are adding the ability for individual records to be edited on the site and having those changes relayed to OSM and, conversely, tracking changes from OSM and updating the building=yes database accordingly.

This work will be broken up in to three pieces:

First is to use the OSM OAuth API as a single-sign-on / login provider in order to obtain a delegated authentication token allowing building=yes to make changes to nodes and ways on the user's behalf. This code has already been implemented as is available both as part of the building=yes codebase and the flamework-osmapp package.

Second: to design and implement a user-interface for editing the tags and geometries (the nodes) for each building (ways). Those changes will then need to be written back to the OSM database using the OSM API along with a workflow and interface decisions to account for error conditions, both from OSM and in the changes generated by individual users. An important consideration in this work will be to consider how, and whether, these tools can be used by or influenced by the ongoing work of the Humanitarian OpenStreetMap Team to map buildings in countries like Indonesia.

Finally, handling updates from OSM in a near real-time (hourly, for instance) basis. Retrieving and extracting changes is relatively straightforward using the Osmosis application but code parsing the resulting change file and applying the differences to the building=yes Solr database still needs to be completed. As of this writing, it is assumed that any changes sent by the OSM server will take precedence over the local database.

Longer-term considerations include:

  • Migrating away from custom software tools and instead using the tools build by and for the OSM community itself. What would it mean to have a series of parallel OSM databases for buildings or other bespoke datasets[12] that might otherwise fall outside the purview of OSM but that could take benefit from the work that’s done to house and distribute that data and reduce the need for custom software?
  • Services for minting ranges of unique identifiers (UIDs) in an effort to prevent ID collision, even across multiple domains or namespaces.
  • Historical support. OSM remains a project resolutely focused on the here and now and there is little if any consideration, technical or otherwise, for historical views on the data.[13] Frankie Roberto’s presentation on “Mapping History” at the 2009 State of the Map conference and the discussion around creating a “History API” for OSM are encouraging efforts to begin to address this problem.


Waiting Spaces

Projects like OSM and building=yes demonstrate is that museums and libraries and archives are no longer the only institutions capable of collecting, housing and organizing cultural heritage. What the Internet has demonstrated is that it is possible for communities of interest to self-organize around a topic and in a relatively short span of time produce bodies of work that sometimes rival traditional scholars in their depth and almost always exceed them in their breadth.

I'm not talking about the mechanics of storage or preservation and conservation. For the sake of brevity I'll also say that I'm not talking about curating or museum programming. What I am talking about, though, is the other thing that museums and archives do: cataloging.

This is the place where museums and the larger public and communities of enthusiasts are meeting and being forced to find common ground. The opportunity facing museums, and by extension museum studies, today is to how to use and shape a participation in that cataloging process: To imagine museums not simply as archive of a considered past but also the trusted waiting space of future considerations.


Atkinstache Cope, A. S. (2012)

Authority Records, Future Computers and Other Unfinished Histories Cope, A. S. (2011),

building=yes source code, Cope, A. S. (2011)

building=yes reverse geocoding data dump,

CC 1.0 Public Domain Dedication,

Flamework, (2011)

flamework-osmapp, Cope, A. S. (2011)

Flickr API (2005),

Gunicorn (2010),

Humanitarian OpenStreetMap Team (HOT),

Imagining the Built Works Registry Cope, A. S. and Kuan C. (2011),

Machine Tags, Theory Working Code and Gotchas Cope, A. S. (2010)


Mapping History Roberto, F. (2009)

MapQuest Open Maps,

OpenStreetMap Without Delay, Amos, M. (2010)

OpenStreetMap, History API,

OpenStreetMap, Key:Addr,

OpenStreetMap, Latest Downloads,

OpenStreetMap, Map Features,

OpenStreetMap, OAuth,

OpenStreetMap, Osmfilter,

OpenStreetMap, Osmosis,

OpenStreetMap, Project Haiti,

OpenStreetMap, Processing the (planet.xml) File,

OpenStreetMap, Protocol Buffers

OpenStreetMap recognized by UN Foundation (2011),


Pass The Corbusier Cope, A. S (2011)

Poverty Mapping with An OpenStreetMap Base in Sumbawa, Chapman, K. (2011)

Reverse Geoplanet, Cope, A. S. (2012)

Shapely, Gilles, S.



Stamen Design,

Yahoo! GeoPlanet,



[1] The source code for both the website and the tools used to pre-process the data is available at:

[2] GeoPlanet is sometime still referred to as Where on Earth (WOE) after the company that originally developed the technology and was purchased by Yahoo! in 2005. The unique IDs for individual records in the GeoPlanet dataset continue to called “WOE IDs”.

[3] The BWR is a joint endeavor of the Avery Architectural & Fine Arts Library at Columbia University, ARTstor and Getty Research Institute to build and maintain a community-generated data resource for architectural works and the built environment, funded by the Institute of Museum and Library Services (IMLS).

[4] There are also “relations” which are collections of ways (a single relation for a motorway might be comprised of multiple “ways” or road segments) but they are not relevant and out of scope for this discussion.

[5] This argument was later highlighted during a presentation about authority records, amateurs and communities of interest at Museums and the Web (MW) in 2011 and which seemed especially relevant to a project like the BWR. Many of these ideas also found their way in to a paper co-authored with Christine Kuan, Chief Content Officer and Vice President of External Affairs at ArtStor, about the BWR titled "Imagining the Built Works Registry".

[6] There is on-going work within the OSM community to bundle and distribute data files using Google’s Protocol Buffer format but it remains experimental.

[7] After the building=yes website was launched many people why the site limited itself to only those ways whose “building” value was “yes”. It turns out that many users will tag their residences as building=home and so on. Future versions of the site will not distinguish between these values and instead filter the planet.xml file with a more liberal building=* (all buildings) query.

[8] ST_Centroid and other built-in, and optimized, spatial functions are at least one reason for using a dedicated geo-spatial database like PostGIS instead of SQLite; there is also the SpatiaLite database which has not been tested as of this writing although in an age of increasingly low cost solid state storage devices (SSD) this argument may not remain relevant for much longer. This point was argued by Artur Bergman in a talk on SSDs during the 2011 Velocity Conference (

[9] “Machine tags” are tags that conform to a specific (and very simple) syntax that defines facets by which a tag may be indexed and queried: a namespace; a predicate; and subject. They were introduced by Flickr in 2007.

[10] The chances of GeoPlanet ever exceeding, or even reaching, the limit of 2147483648 IDs is practically nil which makes the practice of creating building IDs greater than that number somewhat academic. You could just as easily start at the upper limit of 32 bits and count backwards.

[11] The halftone effect is accomplished using Bill Atkinson's original dithering algorithm created for the first Apple Macintosh computers and was implemented in Python in 2007 by Michal Migurski (

[12] One example is the suggestion of adding airplane or flight paths to OSM, which is routinely dismissed on the grounds that it fails to meet the basic requirement that all data added to the database be things which can be traveled to and seen.

[13] There are however weekly snapshots of the “planet.xml” files dating back to 2007.

Program Item Reference: