Skip to Main Content

Research Data Management: Tidy Data

Tidy Data

Tidy Data refers to a structured and standardised way of formatting and organizing datasets that adheres to the principles of simplicity, consistency, and usability. It mainly relates to data stored in spreadsheets and databases.  

The main principles of Tidy Data are that each variable is stored in its own column, each observation is stored in its own row, and each value is stored in its own cell. This enables easier data manipulation, analysis, and visualisation, as it aligns with relational database principles and tools commonly used in data and statistical analysis.  

Consider the following when setting up your data files:  

  • Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data., e.g. FirstName, LastName rather than ‘Name’.  

  • Always keep a copy of the ‘raw’ data separately to your working files.  

  • Avoid formatting to convey information, e.g. bolding words, colour coding, adding comments to cells.  

  • Avoid merged cells.  

  • Export the cleaned data to a text-based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.  

For more training on Tidy Data Principles, please see this online tutorial.  

In addition to effectively managing the data itself, file naming conventions and version control are crucial aspects of effective data management and collaboration in research settings.

File Naming Conventions

Your research will likely collect and create multiple files that require storage, across multiple file types (and versions). It is important to consider how to manage these files effectively.  

  • Consistent and descriptive file names facilitate easy organisation and retrieval of data, ensuring that files are logically grouped and easily identifiable.  

  • Clear and meaningful file names provide valuable context about the contents of the file, including its purpose, date, and any relevant identifiers, reducing confusion and errors.  

  • Well-structured file names streamline data management processes, saving time and effort in locating, accessing, and manipulating files during analysis or reporting tasks.  

Some tips for naming your files:  

  • No more than 25 characters long​

  • No special characters (#, ~, >)​  

  • Use underscore (_) rather than space or full stops​  

  • Consistent dates​ (YYMMDD)  

  • Include the file extension to identify the file type  

Consider the following data points for inclusion:  

  • Date of creation (use consistent date formatting)​  

  • Name of creator​  

  • Descriptive data​  

  • Version number(s)​  

  • Project Number​  

  • Project Name  

Remember! It is more important to be consistent than to follow convention(s).

Version Control

Version control systems (VCS) allow researchers to track and manage changes made to files over time, providing a comprehensive history of revisions, additions, and deletions. They facilitate collaborative work by enabling multiple users to concurrently edit files and merge changes, and they can aid with backup and recovery, reducing the risk of data loss due to accidental deletion, overwriting or hardware failure.  

VCS also enhance transparency and reproducibility in the research process, by providing an active account of changes made over time. Finally, they allow for branching of the research, without risking the integrity of the data.  

Consider the following:  

  • Include version control in your naming conventions  

  • Use automatic back up and synchronisation options whenever possible  

  • Where applicable use systems with included Version Control Software, e.g. GIT, Open Science Framework  

  • Back-up & merge changes regularly  

  • Detail changes made to each version  

  • Always keep a copy of the ‘raw’ data  

For more information please see this guide.  

Librarian

This work is licensed under CC BY-NC-SA 4.0