Tuesday, May 25, 2021

Big Data an Essay

Introduction

Technology has been there among us for a very long time now, from the date of invention of computers we have been using technology in almost every field. It started with simple tasks like performing mathematical calculations, documenting stuff easily and as a means of communication. For a very long time now we have been dependent on deterministic approaches to fulfill our requirements which involves the programming of the hardware either by hardware or by software programs. Somewhere along this journey we came across the problems which were practically impossible to implement using a deterministic approach either because the solution did not exist or we did not understand the steps involved in the solution for such problems. It has been a challenge for the computer scientists and the developers to come with a proper algorithm. Because of all these issues we came across the idea of mapping the data to the solution in this technique v tried to develop algorithms which can look at the data and produce the the output by simply figuring out the logic between the relationship of the input which is the data and the output this approach was very very different from the previous methods where humans used to understand the data and then tried to formulate an algorithm which can take the input from the data and produce an output. This new technology e worked as a revolution in the field of computer science since it was the solution for almost every non deterministic problem we have come across for centuries. As a result nowadays we can see that we have assistants like Alexa Google Cortana etcetera. Apart from these assistance we also have means to cluster the customers, to recommend some specific product to the customers, to do very complex classification and so on. Now because of this revolution in this new technology we felt the importance of the data and this was the point where we started gathering or you can say acquisition of the data from every field from medical, agricultural, engineering, daily life etcetera. According to 2020 the size of the data acquisition system market was about 1.8 billion US dollars and it is expected to increase up to 2.4 billion US dollars by 2025. The demand and supply of such a large size of data requires different efficient and fast data storage techniques as well[6]. The most important part of data acquisition system is the type of data and its storage, very basic requirement for today's computers is being fast and to make any computer system fast one of the major component is to make the data retrieval for the computer fast so as the amount of data we are gathering day by day is increasing, we are realising that the existence of a data storage system which is very efficient and very fast is necessary. Definition of storage suggests that data storage is a collective approach which is used to capture and hold the information using either optical or electromagnetic or silicon based storage devices and the storage of large amounts of data is not the issue, the issue is to make 1 date stored data access very fast. Various options which are considered for data storage nowadays are magnetic tape, magnetic disk, optical disc, flash memory etc. The two most famous means of storage devices are hard disk drive referred to as HDD, solid state drive referred to as SSD and optical storage. The next important step for any application including data is the retrieval and usage of data which involves data extraction, data preprocessing, data analysis and the application. Most of the times the data stored inside storage devices is in the form of raw data which is not processed and contains a lot of noise in some cases the next most important steps involved in the extraction of information from data is noise removal because a data which contains a lot of noise may lead to a false solution or solution which is not of a lot of use, this is a challenge for us today, because removal of noise is quite hard in some cases. Once the data is processed, it is analysed to find the patterns and important information present in it. There are multiple ways to analyse the data. If the data analysis is done solely by computers then and that analysis steps involved are based on statistical values like variance standard deviation and many more but if the analysis is performed by human beings in that case the major part of the analysis consists of data visualisation. so the two major ways to analyse any type of data either by mathematical approach or bi visualisation. Data analysis as it sounds is not very simple because that approach used for different data sets can vary e extensively because not every data is a so the hardest part In data analysis is to decide the techniques which are going to be used. 

Rapid growth of data acquisition over the last few decades

The beginning of data age

Ever since 1936 when Alan Turing invented the Turing machine the idea of a smart computer has been bothering the computer scientist. The use of artificial intelligence was at some time only meant to develop algorithms which can in Act as a human being people used to develop only deterministic algorithms which up to some extent can enact human beings in some specific works. Nobody ever thought that one day the machines would learn by themselves until 1959 when Arthur Samuel. The idea of having machines which can learn by themselves was so fascinating that everyone working in the field of computer science was trying to come up with an algorithm which can learn by itself and along the way they managed to develop the algorithms for classification, clustering and so on.  so by the time algorithms were available people realised that to train such algorithms, they need data.The algorithms like neural network are very powerful but then it is extensive amount of data to be better, and this is where the need of data became clearer, people started storing data, using it but as the computers became faster the need of more data and faster data storage devices became essential.

Need of Data Acquisition

We all know that the best way to boost the revenue of any company is by reducing the cost and by increasing the performance. Today almost every company in the world no matter what the domain is trying to increase their revenue,  to do so they are moving towards automation.  automation not only makes the work faster but also reduces the possibilities of the errors which are common to be made by human employees. The growth of the automation industry is so big that there are some companies like uipath automation anywhere and blue prism which are solely based on automation of softwares. Now there is a problem in the automation of various processes either hardware or software, the future of using deterministic methods to automate the process is limited. A programmer can not anticipate all the possible ways of operation of a process, so the automation needed something better than deterministic algorithms. The introduction of Artificial Intelligence and machine learning in industry process automation came with a revolution. Now the tasks like ocr extraction, text classification, image classification pattern recognition and many more became easily doable using machine learning and artificial intelligence, but these algorithms albino required data and that is why the demand of data storage and data acquisition is increasing so rapidly in the market. 

The Future of Data Acquisition

Technology is being involved with everything starting from a wrist watch and all the way to a rocket, because of that we are able to gather data from everywhere. The development of autonomous vehicles is an example which represents the potential of machine learning and data warehousing. With such a power we can do almost everything and that is why data warehousing is getting more important everyday, leading The demand for data acquisition to increase. This new requirement of the industries created a completely new industry known as data warehousing. The world was generating 2.5 quintillion bytes of data every day, at the beginning of 2020 we created about 44 zettabytes of data, that is about 35 to 40 times more than the number of stars in the observable universe. If we talk about the size of this market then according to 2020 the size of the data acquisition system market was about 1.8 billion US dollars and it is expected to increase up to 2.4 billion US dollars by 2025.

The major parts of the world and their contribution in Data Acquisition Market growth from year 2019 to year 2024 is shown below. 




Types of data and storage

A Brief History of Storage

All started in 1725,  at that time punch cards were used as a means to communicate initially they were used as controllers for looms. Almost a century later in 1837, the first analytical engine was created; at that time the punch cards were used as a means to input and output the information into that machine. The sequence of holes in the punch cards was supposed to be the instructions.  so we can finally say that the very first data storage device was nothing but a punch card with a bunch of holes in some sequence. The punch cards kept being used as storage devices until Magnetic storage took its place in 1960. By the end of the 90s magnetic tapes completely took the place of 5 cars and they were being used as the primary means of storing data, providing input, and taking the output from the computers of that time. Just after India was declared as an independent country, in 1947 the very first time 1024 bits of data was stored in a system of cathode ray tubes. Although the approach was primarily to act as a way to introduce Random Access Memory to computers, we can consider it the basis of modern storage devices. by the end of the 1940s  magnet core memory was developed[4]. By the end of 1953 Massachusetts Institute of Technology acquired the patents of magnetic core memory and the first computer using this technology was developed. Intel developed the first semiconductor chip in 1966 which was able to store a memory of 2000 bits. This technology reduced the size of storage devices significantly[5]. The first final version of floppy disks was developed in 1976 which was an updated version of IBM 8 inch floppy disk. Its size was 5.25 inches and it was able to store about 110 kbs of data. Because of its small size and usage, floppy disks became extremely popular back then.  In the same time span storage devices based on different Technologies for being developed and an updated optical disc was first created in 1980 which led to the creation of CDs, DVDs and Blu-ray. By 1990 a hybrid of magnetic and optical disc was repaired named as magneto-optical disk the size was reduced further to 3.5 inches. The floppy disc kept being used for a long time.  The first portable storage device was introduced in late 2000. It worked as a revolution because now people were able to take the data from one computer to another computer without carrying ugly looking floppy disk. It might seem a bit strange and surprising but the solid state drives Vir present all the way from 1950,  different versions and variations of SSD were being used back then.

Storing Large Amount of Data

The idea of storing large amounts of data at the same place first came to life with the implementation of Data Silos. The primary use of data Data Silos was to store the data for organizations and businesses. That was the beginning of big data and Data Silos became the primary source to store big data until the next better Technology came to life. Data lake was the next technology today with big data. As its name suggests it was basically being used by multiple organizations to put a large amount of data into a single data lake, it was literally a lake of data just like a lake of water. The basic idea to create data lakes was to store big data as well as process it in the same system, the architecture used to store and process the data in data lakes was a NoSQL database. 

The above figure shows the basic architecture of big data, and you can notice the tactics used to The next big thing after the data lake was cloud data storage. Since the amount of data we handle is very large, the architecture is supposed to be capable of ingestion, processing and analysis of data. Data Lake Technology was an initiative two cloud storage since it provided access to most of the information via the internet. It was basically an extension of data lakes and it became possible because of the improvement in the internet connectivity bandwidth and computation power. The development of cloud storage was an effort to make data storage more economical for various organizations and individual users after multiple updates over the years. Now cloud data storage can provide infinite scalability to businesses and it has made the data source ability possible from all over the world without any device barrier.


Extracting Knowledge from Data: Limitations and Dealing with noise

In data warehouses the data gets stored in its raw format,  it is stored this way to prevent any kind of data loss. Because of this reason when users need to extract information from the data they may need to apply several more steps to do so[1][2].  raw data can not be used for any sort of application; it needs to be processed in a way so that the necessary information can be extracted. The major steps involved in information extraction are mentioned below;

  1.  Selection

  2.  Preprocessing

  3.  Transformation

  4.  Data mining

Selection

The related data is extracted out of the warehouse with the specified conditions, since the data stored in the warehouse is the raw data, we cannot use the selected data directly into an application.  Even though for extraction of the data from the warehouse necessary conditions and constraints are applied the extracted data still contains some noise that is necessary information. 

Preprocessing

After selection,  the preprocessing step is applied to the data; the sole purpose of preprocessing is to remove the unnecessary data associated with the selection and only extract the necessary part.  for example, the preprocessing of textual data means the removal of stop words, the removal of special characters  etc. For image data the data stored in the warehouse may be in csv or Excel format and we might need to convert it back to image format.Preprocessing is done to ensure that data is organised into a proper format for further steps and to ensure that best performance can be achieved.

Transformation 

Transformation is applied to ensure that the data contains only the essential and useful features for the further operations. Data transformation may involve the steps like normalisation, generalization and aggregation et cetera[3]. For example consider the data points like “-5, 37, 100, 89, 78” these points can be transformed to “-0.05, 0.37, 1.00, 0.89, 0.78”.  If the transformation is correctly done it ensures the successful execution of the data mining algorithm.

Data Mining

This is the most important part of the whole information extraction. This part consists of numerous Complex and smart algorithms so that we can extract the useful patterns from the data. The data mining step may consist of algorithms like prediction classification, clustering , time series analysis etc. Data mining is not limited to any specific algorithms IT may include algorithms like linear regression, Logistic regression,  K means clustering, neural networks. 

  In information extraction a very important and deadly hurdle is noise. Noise can cause serious problems to the final output of any information extraction process.  Noise in data mainly affects the time complexity of the algorithm,  the classification accuracy of the final result and in some cases the size of the classifier and its reasonability. Initially the dataset is not well known so a random assumption for the type of the noise can not be made. So, to anticipate this problem noise simulation is used. We initially assume that the data is free of any type of noise and simulate noise in the data to find out the effect of noise over all.

The main two type of noises are:

  1. Class Noise

  2. Attribute Noise

Class Noise occurs when the data points are incorrectly labeled. Incorrect labeling can occur because of entry errors, inadequate information, subjective problems during labeling. Majorly the distinguishable class noise is mentioned below.

  • Contradictory examples

  • Misclassifications

As clear from the name attribute noise id caused by the  disturbance in multiple attributes of the dataset. The possible representations of attribute values are.

  • Attribute filled with errors

  • Missing values

  • Non compatible or incomplete values

From the various studies across the world it is concluded that the attribute noise is more fatal than class noise. Also regardless of the above fact that attribute noise is deadlier, it is always best practice to either remove or correct the noisy data points.

Different Techniques used to Analyse Data

Analysis of data is required to find out any specific patterns and relationships between the attributes of the data. The two major ways for data analysis are numerical approach and visualisation technique. Numerical approach is used when the analysis is completely based on computer programs where no specific human intervention is required,  on the other hand visualisation techniques for analysis are used when the analysis is being done by human beings because human beings are better at extraction or recognition of patterns from visual data then numerical data.

The different ways of data analysis using numerical approach are mentioned below:

  1.  Regression Analysis

  2.  Multiple Equation Models

  3.  Grouping Methods

Regression Analysis

The idea of regression analysis is based on the fact that for any data set some of the attributes are dependent on the others and some of the attributes are independent of any other attributes. any variation in the independent attribute values affects the dependent attribute values. The different ways of regression analysis are mentioned below.

  • OLS regression

  • Logistic regression 

  • Hierarchical Regression Modelling

Multiple equation Models

Multiple equation models is a variation of regression analysis with few extensions it is used to find out the existing Pathways from independent attributes to the dependent attributes. It is a very important approach to find out the existing patterns in the data. Multiple equation models allow us to find out either direct or indirect  effect of the dependent variable.The major classifications of multiple equation models approach are mentioned below.

  • SE Modelling (Structural equation)

  • Path analysis


Summary

Storage itself has been a very important part of a computer for a very long time now. It all started in 1725 with the first use of the punch cards along the way storage devices kept evolving to take smaller sizes and to store large amounts of data. Industries like IBM Intel Hitachi have played a very important role in the evolution of storage devices. By day these are the companies which are the major providers of cloud storage and cloud computing. With the introduction of the term big data everything changed now the requirement was not only to store the data but to also make the data storage and retrieval and processing more efficient. From the day humankind realised the importance of data and its potential we have been storing the data. Data acquisition market is increasing very rapidly and so are the Technologies being used in the process. As quoted earlier that only the storage of the data is not important but the performance on the data should also be good apart from this fact there is one more thing which is even more important that is the extraction of knowledge from the Stored data. to avoid any kind of loss of information the data stored in the warehouses is actually in its raw format so that when needed the information can be extracted from it. There is a whole new industry revolving around data warehousing and information extraction.The developing countries like India have very high participation in this revolution and are the major creators of the data. because we have seen the potential and we are continuously moving forward to make sure that Technologies like machine learning and artificial Intelligence can play a a major and important role in our life and as a result the size of data acquisition system will keep increasing everyday. Not only until we have perfect artificial intelligence devices but also to keep them updated we will need data. As the Technologies evolve we may experience an improvement in big data and cloud storage as well. 


References

[1] Inmon, William H. "The data warehouse and data mining." Communications of the ACM 39.11 (1996): 49-51.

[2] Inmon, William H. "What is a data warehouse?." Prism Tech Topic 1.1 (1995): 1-5.

[3] Zhou, Li, et al. "Recent advances of flexible data storage devices based on organic nanoscaled materials." Small 14.10 (2018): 1703126.

[4] Goda, Kazuo, and Masaru Kitsuregawa. "The history of storage systems." Proceedings of the IEEE 100.Special Centennial Issue (2012): 1433-1440.

[5] Mottier, Véronique. "The interpretive turn: History, memory, and storage in qualitative research." Forum Qualitative Sozialforschung/Forum: Qualitative Social Research. Vol. 6. No. 2. 2005.

[6] Holy, Tomas, et al. "Data acquisition and processing software package for Medipix2." Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 563.1 (2006): 254-258.


Ankush Pandey

Author & Editor

Ankush Pandey is a very versatile, hard working and talented engineer and researcher. He has published many research in international and national conferences. His fields of interest and work are image processing, data science and IoT.

2 comments:

  1. Well researched well articulated and futuristic article .

    ReplyDelete
    Replies
    1. Thank You Ashish your appreciation encourage me :)

      Delete