XML, Metadata, and setting up for harvest
by Jens on Jul.29, 2011, under Data Management, Online Data Sources
In rural Scotland, it’s the time where harvesters starts being a common image on the fields. But I am setting up for a different kind of harvest – the harvesting of my organisation’s metadata.
Creating metadata is a great way to get an oerview internally, but there are increasing demands on public authorities to also share the metadata externally. Apart from the legal requirements for serving spatial data and metdata, there is a plethora of ways in which this can be done, and a lot of it comes down to how deep your pockets are, and how many staff hours/days/months/years you are willing to spend on it.
One of the characteristic of most metadata is that it adheres to a known standard. This standard defines the information fields used to describe data holdings, and also details if a fields is mandatory to fill in, optional or sometimes conditional, depending on the settings on other fields.
Common to a lost of these standards are some basic fields, like title, description/abstract, originator, publication and modificaiton dates for example. But some standards goes much further, and start providing much more detail on the source, formats and extent of the data sources.
There is generally considerable overlap between a lot of metadata stanards, and this is one of the sources of frustraiton for people beginning the process of working up metadata in my experience. With a bewildering array of standards, with sometimes only subtle differences, it can be difficult to get an overview of which one to start with. However, my strong recommendation is to spend some time on this aspect. Yes it involves reading long, tedious documentation and comparing across standards. But by getting this knowledge early on, you can avoid setting yourself up for disappointment. Some standards will be supersets of others, and as such require more information. Therefore, it is important to be clear on what your obligations and aims are in terms of capturing and serving information. There’s no need to capture more that needed- the only thing to come out of that is that either you or your staff spend more time on creating the records, or in worst case, abandon creating metadata, because it takes too long.
Next thing is to ensure that you have captured all the required fields for the stanadrds that you are going to deliver in. For example, in my organisaiton, we have a database which was natively in an eGMS metadata format. However, we actually need to serve data externally in a UK GEMINI and MEDIN metadata formats. But rather than completely changing the existing 750 records, we could add some fields, and slightly modify existing ones. This allowed us to capture all the necessary information to comply with new formats even though we strictly speak still worked with an eGMS database.
In this post i’m not going into the creation and user engagement of metadata, but rather the “behind the scenes” element of getting things out. With a datbase, we got the ability to write out the data in XML format.
And this is where its currently at. We can export XML files form a SQL database. These are then validated in two ways. First they are validated for the standards that the XML format (in this case the MEDIN format) lists in its header, and the file’s conformity to those standards as well and generic XML formatting rules.
Once that hurdle is crossed we’re moving into actual publishing of the metadata, and as mentioned, there are many ways to do this, depending on the depth of pockets etc.
Commonly to comply with harvesters from other portals, we must be able to deliver the metadata in either a OAI-PMH format and act as a catalogue server. To comply with these abilities, we’re deploying geonetwork which will accept the individual XML records created locally, and serve them up up the web, both through a browsable and searchable interface, and runnign a CWS at the same time.
So in future posts, i will share about more about experiences about actually releasing and harvesting metadata, but to summarise the recommendations so far:
1. Get to know your standards. The documentation is dry, but necessary. You can save yourself a lot of work by investigating these early on
2. Don’t be afraid to mix’n match internally. If a combination of one or more standards suit your organisaiton better, then it’s not really a problem. As long as you can populate the standards that you want to share externally, you can have more or less fields (or even translate you internal formats to the standard) on the inside.
3. Consider the scale of distribution. How central is this to your business and who is going to use it? If metadata is at the core of your MDM, you prbably want full integration so the metadata is always close to the data, and perhaps even linekd directly to it in yur applications (e.g. GIS). But for others it may be a low number of records, and they can be hand edited and published.
4. More than one standard – or more than one harvester coming to visit? If you know who are going to be your main customers in terms of ingesting your accessible metadata you can set up to target collections for these. It also means that its quite easy to serve the same metadata to multiple groups, simply by “tagging” them – at least in Geonetwork.
5. Ask the devleopers. Wen you read metadata and methodology documentation, it can be easy to think of it all as terribly complex and “i probably don’t understand it”. But if something does not make sense- ask the writer or developers! In a couple of cases my own questions have turned out to be errors in the documentation and they have been updated and corrected as a consequence – thus makes it easier for the next person reading it.