Author Archive
MEDIN – A way forward for UK Marine Science
by Jens on Nov.06, 2011, under Data Management
The Marine Environmental Data & Information Network has been on the scene for quite a few years now.
I have been involved through my work, and attended the recent open partnership meeting, and it struck me that the recent move into an “operational phase” seems to have kickstarted a great leap forward. There were many more organisations present at the partnership meeting than I think has previously been there. Crucially, there is a mix of public bodies (including NDPB’s) and private companies now, and this is important as MEDIN is aiming to embed itself early into contractual agreements for the collection of Marine Data.
By embedding itself, I mean that MEDIN supplied a number of functions and standards that are useful for partners to include in data collection contracts. By including these, the contractor takes on an obligation to include metadata in relevant format, and to stick to standards for certain types of data.
The process for including and utilising MEDIN is that there are several Data Archive Centres (DAC’s) which store data long term and make it available, initially through MEDIN’s discovery portal, but also with view and download services planned for many areas. While there may be costs associated with committing large data sets to an archive centre, there are actually a number of great benefits, which I think are not fully realised by partners or data suppliers:
- Data are stored securely and handled by experts.
- Data are made public (if so agreed)
- The partner does not incur the costs of storing, backing up and maintaining Master Data Management aspects of these data sets
- Partners bandwidth costs are not escalated by starting to publish data.
- Data becomes part of a UK wide network of reporting data
Did increased computing power break data management?
by Jens on Sep.19, 2011, under Data Management
With the risk of the title already having put you off, and labelling this post as lamenting progress over the “good old days”, I am simply offering a few observations.
As part of a data resuce project, I’m currently reading the documentation for a plankton database system from 1994. The documentation pretty much covers everything you would expect to find in a more modern documentation, e.g. a data model, work flows, methods and background code. It was designed to run on an old VAX system, and of course there are some technical limitations that I am glad that we are over.
There are very clear limits on what could be put into tables, and the number of tables that were created. This is obviously a result of having to be careful about referencing. E.g. you did not reference long species names when an identifier was available as it would slow your system to a crawl
However, looking at how very neatly data are structured in this old system, and looking to some of the more “modern” systems, it is actually far easier to re-extract and structure this data. All of the referential structure is intact even though the software itself is long gone, and I am left with nothing but the raw data. Looking at the raw data dump of many modern databases leaves something to be desired (IMO). The strict control of field formatting often goes out the window, and the “logic” of a system is frequently moved almost entirely to the front end, meaning that the DB itself sometimes doesn’t even contain all the referential information required.
Obviously you get what you pay for, and this is far from the case in many modern systems, but with the increased speed and power of computers, it has become ever so easier to throw clock cycles at a workflow problem rather than going back to the root causes.
Given the increase in computing power, I’m not sure if we have seen an equivalent increase in performance of databases. Granted we can store much, much more in there, and in a much wider range of formats, but does it encourage the cutting of corners rather than improvement of data models?
I’m happy to be proved wrong, and I suspect there isn’t a right or wrong, but there is certainly an observation that when you are forced to be economical with your clock cycles, there seems to be a higher attention to your data model.
And now for something completely different – Timelapse of Earth
by Jens on Sep.19, 2011, under General, Uncategorized
This post is way off the leftfield of the normal topics on this blog, but this timelapse video of earth as seen from the International Space station was so impressive I just had to share it!
XML, Metadata, and setting up for harvest
by Jens on Jul.29, 2011, under Data Management, Online Data Sources
In rural Scotland, it’s the time where harvesters starts being a common image on the fields. But I am setting up for a different kind of harvest – the harvesting of my organisation’s metadata.
TechTalk: WinMerge makes comparing files a lot easier
by Jens on Jun.25, 2011, under Software
Pardon me for a deviation into application specific topics, but as a working data manager in the middle of a massive migration of content between systems, I had to shared my recent admiration for the Open Source WinMerge product.
When it comes to moving files around, it is a task that at face value seems incredibly easy, but when you scale it to many servers in terrabyte volumes, and add permissions for hundreds of groups to the mix, then it start becoming a bit of a job.
First part was the straight forward bit of getting things copied, and then synched – Robocopy is the workhorse here – but ultimately, you will probably want to make sure everything got included in those scripts, and this is where WinMerge comes in. The product has several modes of comparison for very fast size comparisons between two sets of folders to full binary comparison of all content. This is a major strength when working large scale as oyu can run fast scans, and home in on any issues this way.
It also allow you to fluently merge across from one set of fodlers to the other if you determine problems, but I mainly use it as a diagnostic tool – and so far, I’ve not found anything near it in user friendliness or performance.
The fact that it is an open source tool (GPL) tool only makes me happier – it is great to see collaborative projects that really nail some of the vaccums in the market!
Data Qualia
by Jens on Mar.02, 2011, under Data Management, Online Data Sources
First off – No i’m not so rushed that the heading is a type-o.
It is instead the title of a short but interesting blog post by Jim Harris over at the OCDQ blog.
In this post, Jim covers how the word “Qualia” is used to describe subjective quality of conscious experience, and how very, very, relevant it is in terms of data quality management.
Anything above and beyond the most basic data quality checking is really in the eye of the beholder. One of the common memes of DQ is that it is “fit for purpose”. Although fitness of data is important, it is the more qulia than quality related element of the purpose that raise issues, especially when it comes to sharing data. Trying to predict the purpose of someone accessing data and ensuring that the data are fit for this purpose is virtually impossible. Yet it is an issue that becomes important when organisations and public authorities may otherwise have to include lengthy licenses to remove doubt about liability.
So perhaps the term Data Qualia is one we will see more often in the future
Getting work done: Anywhere but the office
by Jens on Feb.11, 2011, under General
In his TED talk, Jason Fried, pretty much nails it – why we don’t manage to get work done at work!
Managers and Meetings are the prime source of interruptions and articifial divisions that means you never make it through a full “work-cycle” he explains in the talk. There is a lot of truth in this, in that meetings very often tend to generate two things: more meetings and changed objectives.
So why am i even including this on a data management blog? Well for starters, most data managers work in some kind of office with some colleagues, and quite likelyt can be the source of thew two major evils to other workers: management and meetings. As someone with the word “management” directly in the title, I guess we are perhaps jsut native interruptions to other workers in an organisation, and the need to call meetings to get combined strategies and systems in place, pretty much makes us walking interruption bombs if we’re not careful.
So what can we do in terms of getting good management of data, but without being a constant source of interruption to others? Wll i think that the job of a data managemer is working as a conduit, translating needs, and bridging the technology and business goals of the organisations we work for. Jsut like everyone else, we’ll need periods of uninterrupted work to be creative, and to come up with good solutions – but we also need to respect others need for the same. SO it’s a two-way street – reduce your own interruptions, and be careful about interrupting others. Perhpas use email or IM for something that does not require a room full of people for two hours to reach consensus on. Perhaps consider that the door shouldn’t always be open. It’s good to be approachable – but you need time to concentrate as well.
I think preparation is key as well. When it finally comes to having some of the meetings that simply are necessary – make sure you are prepared. Identify what you need to do at the meeting before you sit in the conference room! Don’t let anyone run away with the agenda or hijack the meeting – make a plan and stick to it. There are meetings that are productive – but those are the ones where people walk out afterwards and have a clear idea of what was achieved, and who is doing what. Those kinds of meetings takes time to prepare for – uninterrupted time
Discussing Data: LinkedIn Groups
by Jens on Feb.02, 2011, under Data Management, General, Online Data Sources
Data Managers benefit from discussing with peers, jsut like in any other trade – and perhaps even more so because of the rapid technological development pace. However, as cash is strapped during this financial downturn, many may find travel options limited and will turn to the web instead (or in addition to the personal face to face meetings).
There are of course many, many places on the web where you can start discussing, but I often find that locating an active community can be a bit of a challenge. LinkedIn is a thriving community with many active groups, mainly focussed on the professional aspects of our lives. The benefit of some of these groups is that they include real life professionals with clear credentials, and have groups where people actively debate issues that are close to the hearts of data managers, ranging from how best to implement quality management through to provacy and ownership issues.
In this post, I have gathered a short list of som e of the groups i enjoy reading and participating in. Note that most of them you will have to request to join – but it helps reduce the amount of spam and keeps the discussion focussed amongst people with relevant interests. I hope you might find it useful, and if you have other sites/communities or groups, please do share
The Data Ownership in the Cloud group on LinkedIn is a global venue for multi-disciplinary networking between technologists and non-technologists interested in providing thought leadership on this critical issue.
IAIDQ Information/Data Quality Professional Open Community
LinkedIn.iaidq.org is an open community for Information Quality, Data Quality and Data Governance professionals (practitioners, consultants, academics, vendors etc.) to support collaboration, learning networking and interaction in a vendor neutral format.
Obsessive-Compulsive Data Quality (OCDQ)
This is a networking group for Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality.
Vendor-neutral does not mean no vendor related content. When the products and services of vendors are presented or discussed, it will be done in an objective manner.
The goal is to foster an environment in which a diversity of viewpoints is freely shared without bias. Everyone is invited to get involved in the discussion and have an opportunity to hear what others have to offer.
Talend Open Source Data Integration
Talend is the first provider of open source data integration solutions, used primarily for ETL for business intelligence/data warehousing, data synchronization, data migration, operational data integration, data quality and MDM.
This group enables Talend users (and friends) to share information, news, discussions, and ideas about their Talend projects – and anything else of interest for the Talend community!
The purpose of this group is to bring together a professional group of individuals to collaborate about issues, problems, situations within the data space.
We all know that the world of enterprise data is sometimes hard to control and understand but together we can make a difference and learn from one another’s experiences. Whether you’re dealing with Oracle, SAP, IBM, etc. it doesn’t matter we still share some of the same pain points.
This group encourages everyone to share their lessons learned, thought leadership ideas, best practices, etc. that will educate and promote the development of world class solutions.
Habitats – (Social Validation of INSPIRE Annex III Data…
Habitats (Social Validation of INSPIRE Annex III Data Structures in EU Habitats) related spatial data is critical in the management of Europe’s bio-diversity. INSPIRE needs work here, particularly in its Annex III data themes: 16-18 (Sea regions, Bio-geographical regions, Habitats & biotopes, Species distribution).
Data Management as a search term
by Jens on Jan.13, 2011, under Data Management, Online Data Sources
When going out on the World Wide Web, searching for solutions, discussions, news, and just about everything else relating to a topic you might be interested in, we leave a little footprint behind. These footprints of what has been searched for can at times be interesting to study. It is not just newspapers that in the lead up to the New Year reports on the most sought after words on the web by using these tools; they can also give you a little insight into topic areas and how they are developing.
Google trends provides an option for combining multiple search terms and look at their trends, either on a global, or national basis (or even sub regions if it is very popular search terms). This tool can be quite useful for understanding a little about the field you work in. Enter some common search terms that you think would be logical variations on a topic, and see how they stack up against each other.
No doubt people can argue long and hard about the selection of search terms that you want to combine to examine trends. In this case, I have taken the “data management term” to see how much it is used in comparison with two related subject “GIS” and “databases”. While in many ways they cover slightly different aspects, there is also a good degree of common ground. People who look for databases terms are involved in a form of data management or development, and GIS users are perhaps some of the most prominent users of data that require careful management of the source data.
I limited the search to UK, as I was mainly interested in looking at the development over time. I was half expecting a steady increase in the GIS term, given many more users, and increasing legal requirements surrounding spatial data and representation. However, when the three terms are combined, GIS is quite stable, while the data management term has developed in the early years, and then stayed relatively stable over the past 2-3 years. It is interesting to see that data management was not on the minds of many people prior to 2005, so it is indeed still a young field.
BODC’s New data catalogue
by Jens on Nov.24, 2010, under Online Data Sources
British Oceanographic Data Centre has jsut announced a new facility on their website to search and retrieve data series directly from the web. While a lot of data could be retrieved before, this catalogue truly opens up access across all categories and project, with over 76,000 data series being put online in a searchable format. The series are mainly CTD casts, but also include bathymetry meterology, optical properties, wave data and more.
The great thing is that data is available in several recognised formats, NetCDF, ODV and ASCII files – so virtually everyone in the field can access this data in a preferred format.
There are some limitations in terms of the way you can refine searches, but most of them makes sense from the perspective of optimising searches and not hanging up the server in searches that return virtually everything.
By the time you have narrowed your search criteria to return 1,000 series or less, you can retrieve results. There’s the option of downloading a KML file of coverage, and you can retrieve data in your preferred format.
It is important to note that we’re talking data series, not individual points here, so even a single series can contain thousands of data points, giving you access to a seriously large amount of oceanographic data with a wide geographic coverage.
The initial map on the start page show waters around Britain, but make sure you either zoom out or pan around as there is data from a much wider region – virtually all of thw world - than what is shown on the map.
You do have to register an account with BODC in order to checkout your “data shopping”, but there is a huge amount of data freely available. The map tells you up front which data series are freely available.
BODC has truly made their data a lot more accessible with this exercise.
