Online Data Sources
XML, Metadata, and setting up for harvest
by Jens on Jul.29, 2011, under Data Management, Online Data Sources
In rural Scotland, it’s the time where harvesters starts being a common image on the fields. But I am setting up for a different kind of harvest – the harvesting of my organisation’s metadata.
Data Qualia
by Jens on Mar.02, 2011, under Data Management, Online Data Sources
First off – No i’m not so rushed that the heading is a type-o.
It is instead the title of a short but interesting blog post by Jim Harris over at the OCDQ blog.
In this post, Jim covers how the word “Qualia” is used to describe subjective quality of conscious experience, and how very, very, relevant it is in terms of data quality management.
Anything above and beyond the most basic data quality checking is really in the eye of the beholder. One of the common memes of DQ is that it is “fit for purpose”. Although fitness of data is important, it is the more qulia than quality related element of the purpose that raise issues, especially when it comes to sharing data. Trying to predict the purpose of someone accessing data and ensuring that the data are fit for this purpose is virtually impossible. Yet it is an issue that becomes important when organisations and public authorities may otherwise have to include lengthy licenses to remove doubt about liability.
So perhaps the term Data Qualia is one we will see more often in the future
Discussing Data: LinkedIn Groups
by Jens on Feb.02, 2011, under Data Management, General, Online Data Sources
Data Managers benefit from discussing with peers, jsut like in any other trade – and perhaps even more so because of the rapid technological development pace. However, as cash is strapped during this financial downturn, many may find travel options limited and will turn to the web instead (or in addition to the personal face to face meetings).
There are of course many, many places on the web where you can start discussing, but I often find that locating an active community can be a bit of a challenge. LinkedIn is a thriving community with many active groups, mainly focussed on the professional aspects of our lives. The benefit of some of these groups is that they include real life professionals with clear credentials, and have groups where people actively debate issues that are close to the hearts of data managers, ranging from how best to implement quality management through to provacy and ownership issues.
In this post, I have gathered a short list of som e of the groups i enjoy reading and participating in. Note that most of them you will have to request to join – but it helps reduce the amount of spam and keeps the discussion focussed amongst people with relevant interests. I hope you might find it useful, and if you have other sites/communities or groups, please do share
The Data Ownership in the Cloud group on LinkedIn is a global venue for multi-disciplinary networking between technologists and non-technologists interested in providing thought leadership on this critical issue.
IAIDQ Information/Data Quality Professional Open Community
LinkedIn.iaidq.org is an open community for Information Quality, Data Quality and Data Governance professionals (practitioners, consultants, academics, vendors etc.) to support collaboration, learning networking and interaction in a vendor neutral format.
Obsessive-Compulsive Data Quality (OCDQ)
This is a networking group for Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality.
Vendor-neutral does not mean no vendor related content. When the products and services of vendors are presented or discussed, it will be done in an objective manner.
The goal is to foster an environment in which a diversity of viewpoints is freely shared without bias. Everyone is invited to get involved in the discussion and have an opportunity to hear what others have to offer.
Talend Open Source Data Integration
Talend is the first provider of open source data integration solutions, used primarily for ETL for business intelligence/data warehousing, data synchronization, data migration, operational data integration, data quality and MDM.
This group enables Talend users (and friends) to share information, news, discussions, and ideas about their Talend projects – and anything else of interest for the Talend community!
The purpose of this group is to bring together a professional group of individuals to collaborate about issues, problems, situations within the data space.
We all know that the world of enterprise data is sometimes hard to control and understand but together we can make a difference and learn from one another’s experiences. Whether you’re dealing with Oracle, SAP, IBM, etc. it doesn’t matter we still share some of the same pain points.
This group encourages everyone to share their lessons learned, thought leadership ideas, best practices, etc. that will educate and promote the development of world class solutions.
Habitats – (Social Validation of INSPIRE Annex III Data…
Habitats (Social Validation of INSPIRE Annex III Data Structures in EU Habitats) related spatial data is critical in the management of Europe’s bio-diversity. INSPIRE needs work here, particularly in its Annex III data themes: 16-18 (Sea regions, Bio-geographical regions, Habitats & biotopes, Species distribution).
Data Management as a search term
by Jens on Jan.13, 2011, under Data Management, Online Data Sources
When going out on the World Wide Web, searching for solutions, discussions, news, and just about everything else relating to a topic you might be interested in, we leave a little footprint behind. These footprints of what has been searched for can at times be interesting to study. It is not just newspapers that in the lead up to the New Year reports on the most sought after words on the web by using these tools; they can also give you a little insight into topic areas and how they are developing.
Google trends provides an option for combining multiple search terms and look at their trends, either on a global, or national basis (or even sub regions if it is very popular search terms). This tool can be quite useful for understanding a little about the field you work in. Enter some common search terms that you think would be logical variations on a topic, and see how they stack up against each other.
No doubt people can argue long and hard about the selection of search terms that you want to combine to examine trends. In this case, I have taken the “data management term” to see how much it is used in comparison with two related subject “GIS” and “databases”. While in many ways they cover slightly different aspects, there is also a good degree of common ground. People who look for databases terms are involved in a form of data management or development, and GIS users are perhaps some of the most prominent users of data that require careful management of the source data.
I limited the search to UK, as I was mainly interested in looking at the development over time. I was half expecting a steady increase in the GIS term, given many more users, and increasing legal requirements surrounding spatial data and representation. However, when the three terms are combined, GIS is quite stable, while the data management term has developed in the early years, and then stayed relatively stable over the past 2-3 years. It is interesting to see that data management was not on the minds of many people prior to 2005, so it is indeed still a young field.
BODC’s New data catalogue
by Jens on Nov.24, 2010, under Online Data Sources
British Oceanographic Data Centre has jsut announced a new facility on their website to search and retrieve data series directly from the web. While a lot of data could be retrieved before, this catalogue truly opens up access across all categories and project, with over 76,000 data series being put online in a searchable format. The series are mainly CTD casts, but also include bathymetry meterology, optical properties, wave data and more.
The great thing is that data is available in several recognised formats, NetCDF, ODV and ASCII files – so virtually everyone in the field can access this data in a preferred format.
There are some limitations in terms of the way you can refine searches, but most of them makes sense from the perspective of optimising searches and not hanging up the server in searches that return virtually everything.
By the time you have narrowed your search criteria to return 1,000 series or less, you can retrieve results. There’s the option of downloading a KML file of coverage, and you can retrieve data in your preferred format.
It is important to note that we’re talking data series, not individual points here, so even a single series can contain thousands of data points, giving you access to a seriously large amount of oceanographic data with a wide geographic coverage.
The initial map on the start page show waters around Britain, but make sure you either zoom out or pan around as there is data from a much wider region – virtually all of thw world - than what is shown on the map.
You do have to register an account with BODC in order to checkout your “data shopping”, but there is a huge amount of data freely available. The map tells you up front which data series are freely available.
BODC has truly made their data a lot more accessible with this exercise.
Ontology Part 4: Digging a bit deeper
by Jens on Nov.03, 2010, under General, Online Data Sources
- Syntactic Challenges – e.g. different models and languages
- Schematic Challenges – e.g. structural differences
- Semantic differences – e.g. different meanings and understandings.
Get INSPIRE’d with new presentations
by Jens on Oct.12, 2010, under Data Management, Legal, Online Data Sources
As the deadlines for the first batches of INSPIRE Annex I metadata is approaching in December, people are waking up to the work ahead. The UK Location Programme has been working hard at developing solutions, and preparing information for anyone needing to meet the INSPIRE directive in making spatial data available and accessible.
The latest culmination was a Data Providers workshop, where a set of presentations laid out what will be expected in terms of delivering data, and what tools are going to be made available to help thise work. The presentations are now online, and I’d recommend anyone finding themselves having to publish spatial data under INSPIRE to have a look at these – they are a great introduction to the broad concepts, while there is also a good amount of detail for the more technically minded – helping to make decisions on the use of these tools or other services.
Ontology Part 3: Sharing it
by Jens on Sep.22, 2010, under Data Management, Online Data Sources, Software
In the last couple of posts I have been talking about ontology tools. In the meantime, I have been working a bit with Protege, getting a basic skeleton up (50 entitites/classes, similar number of instances, and a handful of object relations, and some data fields associated). Now, however, there is a point where I need to start sharing it with a group of colleagues. Not everyone in this group will be au fait with running protege and delving into the bowls of OWL files.
At first I looked at the simple exporter function, OWLDoc that will drop your ontology in plain HTML and which will probably end up being the basic option for starters. It isn’t pretty without some css work done to it at least, but it still saves you explaining aspects of new software to someone who really should be contributing their knowledge and expertise about the subject in the ontology, not become full time editors.
There is also a neater Apache Tomcat servlet, ontology-browser, which seems to work very nicely. The slight caveat there is that I don’t think we have a spare server lying around for running it on. Remote hosting could be an option, but is not ideal, given some of the nature of information that may end up in the ontology.
Finally, i stumbled across a presentation on SlideShare, talking about implementation of semantic import into Drupal. This has gotten me rather excited as I am a long time Drupal user, and like the extensibility of the system. There’s already ideas buzzing around on the potential power of combining importable ontologies directly with web-based presentation material of different instances. But I still have it to try, and it raises some similar issues in terms of server to the ontology browser – but still thought I would share the presentation here
Integration in Taxnomy.
by Jens on Aug.26, 2010, under Data Management, Marine Life, Online Data Sources
My background is in zooplankton ecology before walking down the data management route. As such i still keep an eye on things in this area, and the recent report from the ICES Study Group on Integrated Morphological Taxonomy (SGIMT) has released their report, wherein the recommendation is put forward to standardise marine taxnonomic nomenclature. No big surprises there – it is an area that we’re all too familiar with – there are lots of areas where things should be controlled better – but you have inherited a system with 10,000 old versions of names and there is neitehr the time nor the money to update it all.
However, the World Register of Marine Species (WORMS) include a Taxon Match facility, which will match up your list of species names with the authorative list and provide you with additional reference information, including ITIS codes (TSN), Aphia ID, authorities, Kindum, phyla etc. which gives you a good chance of restruturing and updating older lists which may have drifted. I tried it out on a list of approximately 7,000 species of phyto- and zoo-plankton (although you have to break it down into chuks of maximum 1500 records in a single match), and generally got about 60-70% match. It’s pretty nice to have clear up nearly 5,000 records for an hours work rather than a long and painful serach of each individual line.
OBIS Seamap
by Jens on Aug.23, 2010, under Marine Life, Online Data Sources
OBIS (Ocean Biogeographic Information System), originally established under Census of Marine Life is animpressive alliance of people working to make biogeographic data available. As a whole, they now hold well over 27 million records and 849 data sets, which are accessible through the portal.
However, I thought I’d like to highlight a particular aspect – the OBIS Seamap. It includes observations on marine mammals, seabirds and sea turtles, as well as accessing environmental variables. In addition, there are links to a wide range of tools and additional databases ranging from photographic fin matching to sea turtle nesting sites.
However, it is the functionality of this site that is really impressive. Beyond the ability to search through over 2 million observations by data set, species, locations etc etc. you are also able to extract all the relevant information directly to freely available mapping tools such as google earth (export a kml file to work on, based on your search results) as well as OGC compliant formats for web mapping or file services. Altogether, the strong presentation of data sets combined with a well laid out and thought out set of functionalities demonstrates a very competent site, which will hopefully serve as inspiration to others looking to publish large volumes of marine data online.
