A password will be e-mailed to you.

Correctly knowing and structuring your data matters if you want to get to grips with advanced data analytics.

Data is among the critical assets of businesses today. It is a means to inform decisions, measure performance, gain insights and add value. But, data not adequately prepared and validated can lead to misleading results.

Today, data comes in many more shapes and sizes, and it is becoming more common to be dealing with unstructured data as well as structured data. This trend is set to continue, and if you haven’t seen it creep into your world, it will almost certainly over the next few years. This quote from Christie Schneider, May 2016, IBM sums it up:

“There has been a paradigm shift in data growth. From mostly structured, and not too much of it, to mostly unstructured, and a lot of it. Businesses use structured data every day through relational databases and spreadsheets, where patterns can easily be identified.

However, unstructured data, which comes in the form of emails, social media, blogs, documents, images and videos, represent a significant source of opportunity for businesses. Due to its unstructured nature, it is difficult for people to gain insight from it using conventional systems. And because so much of data created today is unstructured, organisations need to be able to understand what’s in this data, or risk missing out on significant amounts of digital intelligence.”

Depending on the source and researcher; unstructured data is estimated to account for somewhere between 80-90% of all data. That’s big data!

There has been significant growth in the practice of data visualisation in analytics. Data needs to be prepared to enable informative and reliable visualisations and analytics. When first setting out to answer a business question through data analytics, the dataset often needs to be prepared first.

Data Types

Data types need to be understood so that when we prepare data, we ensure we correctly set the attributes so we can slice the data and perform calculations.

When you’re importing, exporting or preparing a database, do you give enough thought to the data types? It’s certainly worth getting this right, especially in the preparation stage

  • Boolean – True or False, 1 or 0
  • Float – Numbers with or without decimal places
  • Integer – Positive or negative whole numbers, no decimal places
  • String – Alphanumeric text
  • Character – A single letter, number or symbol
  • Date

Structuring data

Data can arrive in a structured or unstructured format. Structured data is formatted in columns and rows. You’ll be familiar with this structure as it is the one used in Excel. For example, if you see data in the rows one through N and in columns A through N. Unstructured data is unformatted; an example might be a text file or a CSV. You have to prepare the data by telling the software how you want the data separated before you can use it for reporting and analysis.

Here’s a tip: a good discipline is to always structure your data in columns and rows, with each column having a header and each row being a unique transaction (e.g. an invoice) or part of the transaction (a line item of an invoice) with a unique identifier (transaction ID, invoice number, etcetera). If your database uses multiple separate tables for transaction data, have this linked by the unique identifier; that way, you can join all the individual tables together based on the identifier.

Maintain the structure of the data as closely as possible to how it exports from the source system. Ideally, you are exporting the data via an API straight into a database. The reason for this is that you’re not amending the source data, so it keeps its integrity. You can create your calculated fields in the SQL database or (less preferably) in Excel. Either way, you’re reducing the manual copy and paste of data into a tool like Excel, where you can edit it. As soon as this happens, the inconsistencies start to appear.

Start to think like a database engineer. You’ll find your databases become more consistent in their structure and more useable for analytics with fewer errors coming out at the end.


DEEP DIVE LIVE: HOW DO YOU MAKE ‘BEST-IN-CLASS’ REPORTING A REALITY?

If you are looking to improve your reporting output and capability, this session is for you.

Thursday 22 October, 2pm BST


Common issues with data

When preparing data for analysis, reporting, or ad-hoc decision making, it’s essential to be aware of the common areas where data can cause errors or misleading results. Here are some of the main issues you might come across:

Null values

i.e. cells or fields that contain no data can impact calculations. Search for null values when in the data cleaning process, determine if they have an impact. For example, will you be performing a count of data fields in a column with nulls? How would you like your visualisation to show them?

Inconsistent values

Dates in a different format, such as a UK date format and US date format in the same column. After importing data into Excel or BI tools such as Tableau or PowerBi check the expected date format.

Schema

The shape of the dataset – the columns, column names and their order. New data may have new columns. If you paste over existing data, it may cause errors in formulas. Deleting columns and rows after importing changes the schema.

Relevance of data

Is what you are bringing through relevant? If there are columns which don’t provide value exclude them.

Size of data

Too many columns can cause performance issues in data visualisation. Filter out what you don’t need.

Outliers

Outliers are extremes in data points. You can quickly identify them using distribution charts or a data prep tool.

Errors

In the data entry stage, it is possible that mistakes will occur, either human or formula error. For instance, in a column containing people’s ages, an age above 100 or below 0 would indicate errors in the dataset.

Other factors to consider:

  • Leading and trailing characters
  • Spaces
  • Line feeds
  • Carriage returns
  • Unix/Windows file formats
  • Unicode: UTF8 vs ANSI
  • Commas and decimal points – the EU uses commas, UK use decimal points
  • Currency symbols
  • CSV files – the UK standard delimiter is a comma, but in Germany, it is a colon
  • Delimited files – if commas in address this may be incorrectly treated as a delimiter
  • Transposition errors – check by dividing by 9

Search Google for common text file errors to find out more.


Authored by Sam Ellis.

Sam Ellis is a data analytics expert and company accountant at Interworks.