Wednesday, March 2, 2016

Big Unstructured Data v/s Structured Relational Data

As we know by now, data by definition is any information that can be translated into a form for convenience to move and process. Data can exist in many forms such as numbers or text in a piece of paper or bytes and bits stored in computer memory. Through this blog we try to understand how we classify data based on how it is stored in a computer system.



Structured Data

We can think of structured data as any data that resides in a file or record with a fixed field. For example, data that is stored in an excel sheet with different rows and columns or data that is systematically stored within a relational database, all these are examples of structured data.

As a basic definition we can say that data that is stored with a high degree of organization that can be easily accessed, searched or operated upon. Typically this type of data is stored in relational databases with ordered rows and columns along with fixed data types such as varchar, boolean, alphanumeric values etc. For this a data model must be defined first based on the business requirements followed by the data type, data constraints such as referential integrity etc., and metadata information for the relational database. Data is retrieved and managed using Structured Query Language (SQL) which is the most popular method used. Operations such as insert, update,  delete etc are performed on structured data using SQL.

Unstructured Data

This type of data generally includes texts and multimedia data such as images, audio, videos, webpages etc. They are usually called unstructured data because they typically do not fit into a conventional database. It is estimated that 80 to 90% of data in an organization is unstructured data and with the advent of advanced computing the volume of unstructured data is ever growing.

Unstructured data in its original form does not give any meaningful insight. This type of data needs to be first extracted and prepared to be processed to get some meaningful information for an organization.

Here's a simple table that shows the difference between structured and unstructured data:



Until recently organizations were overwhelmed by the large volume of unstructured data. Of late there are tools and techniques that are used to manage and organize data in an efficient manner. These tools and techniques can be broadly classified as follow:
  • Big data: Tools like Hadoop help in structuring data that are extremely complex and volatile in nature.
  • Business Intelligence: Tools such as IBM Watson help organization to analyze data and provide visualization of information through dashboards. 
  • Search & Indexing tools: These tools help in retrieving useful information from unstructured files such as web pages, word documents etc.
From a data warehousing point of view in any large organization, some of the common type of data that would be present are as follows:
  1. Metadata: Simply put, Metadata is defined as data about data. It contains information that is required for extraction, transformation and loading of data from various source systems into the data warehouse.
  2. Historic data: Organization have large volume of historic data that can be useful in providing insights into various aspects of the company. 
  3. Derived Data: Derived data is obtained from existing information through some mathematical operations or data transformation. Such data can be generated at run time and can sometimes also be stored as part of the database schema.

Data Warehousing

Data warehousing allows an organization to store large data sets from different departments or areas of the organization accumulated together. Data from different OLTP applications and other sources are extracted which can be then used by analytical applications and user queries. It helps to collect and process data that can be later presented to business users.
Here's a small video explaining the whats and hows of data warehousing.





Limitations of Data Warehousing

While data warehousing is a very useful method of effectively managing large volume of data from different sources, some of the limitations with it are as follows:
  • Ownership of data is lost once it enters data warehousing systems from original data source. Security, privacy and accountability of data can be in concern in this case.
  • There usually is a long implementation phase for initial implementation of data warehousing along with associated high costs
  • It becomes difficult to incorporate changes in data types, data source schema, indexes and queries once the data warehouse is completely setup 
  • Data warehousing requires a high maintenance cost. This is because any change in the business requirements or source data will lead to significant changes in the data warehousing process which increases the cost.


Future of Data Warehousing 

Data warehousing has been around for a few years now and it is now evolving as a analytic warehouse. More and more vendors are coming up with data warehouses that have advanced statistical capabilities for performing analytics and forecasting. Emerging platforms such as Hadoop act as a distributed file processing that can enable processing of large volumes of unstructured data easy. With cloud computing and mobile computing already popular and with the emergence of internet-of-things, the volume of data is going to increase exponentially. Processing data and performing data analytics in the cloud will become is predicted to be the norm which will make warehousing simple and convenient. With cloud-based data warehousing, the cost of traditional on-premises offerings as well as management overhead costs will significantly be lowered. We can also see structured and unstructured data from back end systems being brought into the data warehouse in real- and near-real time.
The ability to incorporate big data techniques, analytics technologies, back end systems, and traditional data warehouses will potentially change the economics of data warehousing in the future.


References:

  • http://www.webopedia.com/TERM/U/unstructured_data.html
  • http://www.webopedia.com/TERM/S/structured_data.html
  • deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/
  • https://www.youtube.com/watch?v=cmQomHNZW4g



No comments:

Post a Comment