Data organization is about working more efficiently with data. Creating and using data requires some level of data organization. Often this organization becomes time consuming and error prone, in which case automated data organization methods should be considered.
Researchers collaborating on projects will often need to share primary data and preliminary results; hence, it is often necessary for them to transfer data between computer systems. Researchers may also wish to transfer data stored on their university computer from outside the university, such as when overseas.
The most common method for transferring files is with email attachments, but there are limits to the size of files that can be transferred. Removable data storage media, such as USB keys and CDs or DVDs can transfer large amounts of data, but require the researcher to physically carry the data to its destination.
Large files are usually transferred using FTP (File Transfer Protocol). FTP allows the user to download as well as upload. Access to files can be restricted by username and password.
An FTP client (such as FTP Explorer) is used to connect and transfer files, although most web browsers can access FTP servers by entering the URL in the location bar with http replaced by ftp. The scientific community increasingly uses the SSH File Transfer Protocol (also known as Secure File Transfer Protocol, Secure FTP, or SFTP).
To assist good data management, the ANU provides local area network and Internet access to the Homedrive, a central file space on which each member of the University is allocated file space to store personal files (students/staff 4.5GB). Homedrive is accessible from any Information Commons computer.
One of the most popular file synchronization programs is WinSCP, which is primarily for SSH and FTP transfers, but can also synchronize data.
rsync is another widely used open-source utility for incremental file transfer and synchronization. It is cross-platform and can be used to generate 'snapshots' and regular backups.
The use of dedicated Version Control Software (see page 11) is another option for file synchronization.
Commercial, user-friendly, file synchronization services are becoming increasingly popular. Dropbox, for example, provides 2GB of data storage for free and also provides good user management tools to support collaborative work.
Many research projects are carried out collaboratively: between postgraduates and their supervisors; within departmental research groups; as cross-discipline research, and as inter-university research.
This is mutually beneficial as it improves access to funding; avoids repeating costly experiments; increases recognition through co-authorship; and can help lead to new research ideas.
For simple tasks this is usually done by transferring data by email, USB-key, or a network drive. Publications with multiple authors are often written this way – authors will take turns editing the document and email it to their colleagues, or the primary author will periodically email the latest version and their colleagues will reply with corrections and additions.
These methods are adequate for simple work and if there is only a small number of collaborators. It is worth considering using collaborative software tools, such as the ANU-provided Alliance and/or version control software.
Alliance is a web-based service that allows ANU staff and students to easily set up collaborative project sites. Alliance provides a wide range of collaborative tools such as forums, chat rooms, calendars, and more.
Sakai is the underlying software for Alliance.
When the data is constantly being edited, especially by multiple users, it is a good idea to implement some form of version control to keep track of changes. This can be as simple as appending a number to the end of a file after each major edit. For example:
Such conventions are good for simple work but quickly become unmanageable when you have multiple authors or make lots of edits.
The alternative is to use revision (or version) control software. These programs are used extensively for software development but are also excellent for documentation, such as writing a paper with several authors. Version control software also provides access control, a collaborative work environment, synchronization between home/office/laptop computers, and a degree of data safety (although not as good as proper backups).
Such programs offer several advantages:
While the time required to learn the software may seem like a drawback, it is highly recommended for people in order to avoid regular problems with simple filename version control.
TortoiseSVN, for example, is a popular program that uses the Subversion system of version control. It integrates with Windows Explorer making it one of the easiest version control systems to use.
While version control software is in some cases harder to set up, it provides more advanced version tracking. A distributed version control system like Bazaar can be used with Alliance to collaboratively manage documents and data.
Such tools also make it easier for any number of people to work on a document or code. It is also more efficient as everyone has access to the latest version and can make edits without conflicting with other people’s changes. The entire history of the document is also stored, making it easier to revert to an older version and for users to see what has changed they last looked at the data.