February 14, 2013
by Lindsay Wood
DataFlow project background:
DCC catalogue record: http://www.dcc.ac.uk/resources/external/datastage
Two tools
(a) DataStage, for researchers to manage their research data locally.
“DataFlow lets researchers save their work to a DataStage file system that appears as a mapped drive on their computer, a lightweight system requiring them to install no special software on their computers.“
More details: http://www.dataflow.ox.ac.uk/index.php/datastage/users/researchers
(b) DataBank, to preserve and publish valuable research.
“DataStage is a secure personalized ‘local’ file management environment for use at the research group level, appearing as a mapped drive on the end-user’s computer.“
More details: http://www.dataflow.ox.ac.uk/index.php/databank
Firstly, it’s great that the DataFlow team have released this system openly for re-use. Below are some of our findings.
From a local technical infrastructure assessment:
Ubuntu is not our standard Linux platform (which is Red Hat/CentOS). It would almost certainly be possible to port the Dataflow packages to CentOS (and feed this back to the main project) or use Ubuntu as an appliance (but this would mean that the systems used for this would not be managed by our standard configuration system). Either option comes with a reasonably significant cost.
The feeling that we got from installation (testing prior to 24 July 2012) is that the system is in the early stages of its lifecycle and our assessment is that Dataflow is not yet of sufficient maturity to deploy in production at Newcastle. It would be worth re-evaluating this decision at a later time, this would be prioritised against end users who have tried the system i.e. the more that they liked it, the more worthwhile putting resources into trying it again/working with the DataStage developers.
In terms of initial user testing (in early August 2012/and on ‘v0.3.1rc2’ Oxford installation), initial feedback was:
User testing – DataStage:
Users liked the feature specification of what it offered as a tool (desktop integration through mapped drives, web access aiding working from home, do not need a designated computer for their research work, setting of different access writes (private, public, and collaborative) and the ‘invite to share’ options. System interface is fine, basic yet functional and could be ‘skinned’ to institutional brand. Uploading documents/data files is straightforward.
My opinion was if an institution had no existing RDM systems, it would be a very useful ‘bootstrap’ system providing a simple functional system.
Seamless integration of a data file staging system/VRE with the user desktop (ideally through ‘drag & drop’/mapping over existing user networked drives) and through web access are key features that are top of an ‘average’ researchers wish list.
Making sure research data sets can be appended with an appropriate level of metadata in ‘data staging’ RDM tools (or perhaps later in lifecycle as practical?), so that metadata can flow through to an eventual data catalogue/or national repository is important RDM requirement. Thus, making sure that this function is provided to researchers is important to flag and DataStage/DataBank are a good approach to this.
I thought more data file re-use metadata capture would have been an option in DataStage (noting manifest/Zip package upload feature), pulling in automatically from individual data file itself (that’s probably me being simplistic on technical aspects?) ahead of the DataBank stage?
We noted that not all users are comfortable or had success in Windows drive mapping (network path errors), so some end user support would be needed. Users have high expectations on usability – ‘as easy as DropBox’.
Error messages while testing – access forbidden, 505/405, ‘submit as data package’ – where an entered/saved password was looping? (more helpful customisation of error messages, such as ‘this problem normally occurs because of x, y or z – wrong password, wrong file path, etc.’. (rather than ‘Error 505’/’Error 404’ would be helpful.
User testing – DataBank
Liked:
– Simple, clean functional interface – again could be ‘skinned’ to instituitional brand.
– Current search/’on-off’ filters was good
– Assigning a DOI/RDF were useful RDM specific features.
– Licensing/embargo fields
– Simple admin interface
– CSV/JSON exports are useful
– Rest API was documented
Suggestions:
– Clarifying, who was intended user audience for DataBank? Researcher or archivist?
– Terminology – not understood by user testers – ‘Silo’, ‘Mediator’, ‘Aggregate’ – obviously this could be changed easy.
– RDF and click through access to XML schema was confusing for our testers (they were not archivist, librarians, metadata experts – who would probably appreciate this function – i.e. package/manifest upload/explore)
– A basic tagging interface/fields to populate the RDF/XML for none specialists would be more friendly
– Again frequent error messages (404 not found/ 500 Internal Server Error, ‘Add manifest’ gives 505)
Documentation for DataStage/DataFlow researcher end users:
User documentation for researchers seemed a little sparse (I think the project/developers noted it is a work in progress i.e. https://github.com/dataflow/RDFDatabank/wiki). More end user documentation would facilitate wider take up. To note, technical installation documentation was more detailed with screen shares, etc.
We look forward to further DataFlow project developments.
DataFlow user forum is at: https://groups.google.com/forum/?fromgroups=#!forum/dataflow-users