In our today’s blog, we are going to discuss the Big Data phenomenon from the viewpoint of software engineers and their top managers who face a choice: to develop competencies in Big Data and some other state-of-the-art technologies such as machine learning, and, thus, offer company services in this field; or not to rush and continue improving those skills that one succeeds in and that sell well. Whatever the answer to this question is, let’s try to figure out what is beyond the term “Big Data”. Having read these statements, some would say that the author has been late, and there is no Big Data any more, as the term disappeared from the 2015 Gartner Hype Cycle chart, falling apart into several other terms. A number of publications exclaimed: “Big Data is out!” and referred to the term as an obsolete marketing phrase.
However, if we turn to the history of this domain, then we can see that pioneers of Big Data considered it as a complex phenomenon that included at least three aspects. The first one is technological, that is a combination of technologies that allow operating with data in three dimensions (3Vs): large volume – large velocity – large variety. According to this definition, a video content search system could be referred to as a Big Data system only when, apart from searching videos in the storage of dozens and hundreds of terabytes, it was required to find a video in a limited time frame and use all available data sources on films, not just video files (for example, their text reviews). The second aspect of the Big Data phenomenon focuses on solving such problems that, prior to this approach, have never been even stated. In this case, it means search for extremely rare events that emerge only within a joint analysis of billions of events or within search for correlations among events in billions of data flows. And, finally, the third aspect lies in the mythology that exists around this approach. Big Data evangelists started mentioning new opportunities for solving the most fantastic problems, for example, full control over the whole mankind. In the meantime, sceptics and opponents attacked Big Data for violating privacy and criticized problems and technologies related to this field. Probably, we should also refer the dispute over the end of the Big Data era to the mythological aspect of Big Data. As for Gartner Hype Cycle for emerging technologies, this chart just specified new technological aspects of Big Data in general that had transformed from an “emerging” into “mature” technology. This point of view has been proved by marketing reports and research that still consider Big Data an independent market niche and a product identified by its specific features.
It should also be noted that Big Data doesn’t not refer to one of the examples of the directions that emerged earlier in information technology, the so-called high-performance-computing (HPC). HPC continues developing successfully, however, it deals with a different set of tasks, based on complex mathematical models that also use a different class of computing machines – supercomputers, supercomputer clusters, and grid computing. Big Data technologies, as a rule, are based on a different computing environment – commodity computers and use a simple scaling scheme by adding such computers.
Highlighting new tasks, introduced by Big Data, it is worth mentioning their close interweaving with the problematics integrated in the term “IoT” (Internet-of-Things), which is a system of interrelated sensors and actuators of various purpose and powerful computational processing systems integrated via the Internet. In particular, it has become possible to build the so-called virtual power plants (VPP). They collect data on the current energy consumption from all smart electricity meters in the region as well as data on power generation of all solar panels, wind power generators, and other energy sources. Besides, they aggregate data on software management of power network configuration in order to balance loads and sources and maximize production surplus of the distributed generation system.
Let’s now turn to the technological aspect of Big Data. Using the 3V’s model creates a need to build highly scalable data storage systems and process them effectively. Scalable technologies where data storage expenses grow linearly with the data size are commonly used in Big Data. Therefore, almost all developers apply Hadoop Distributed File System for building file storage systems. HDFS is part of Apache Hadoop, the Big Data general framework, and it can be installed on a single computer as well as on a computer cluster integrated by the IP network. Along with the file system, Yarn is used, that is a data operating system providing resource management across the whole cluster. The key feature of HDFS is that it is not a low-level file system operating on the device and kernel level like NTFS, but a high-level abstraction using a standard file system and functions of a local OS and virtual machines. One of the Big Data technologies aimed at increasing processing performance is In-Memory Processing, a technology that stores all processed data in RAM and excludes swapping in data processing. We can recall that IBM Watson, a well-known natural language QA system that won at the TV show Jeopardy!, stored all the game information in 16 TB of RAM and made no Internet requests.
Now let’s talk about specific features of Big Data processing. Uncertainty of data generation models in Big Data tasks and the necessity to achieve low latency have led developers to the wide use of meta-algorithmic approaches. They are not based on the development of processing algorithms of a specific data set described by a generative model, a format or data relations, but on the development of metamodels capable to configure into various algorithms depending on the processed data. Such approaches are called Data Driven Systems. Among these approaches machine learning (ML) is the most advanced and widely used today. Application of ML in developed software products has specific features covering all the development phases and even the relations with a customer. A more detailed description of all these aspects is a topic for a separate article, and now let`s discuss the types of technologies included into Data Driven Systems. Historically, the first approach to the machine learning system was so-called supervised ML. A task is solved with the help of supervised ML, if we need to make a conclusion on incoming data; and we have a large data set, generated by slightly varying reality, as to the part of which we precisely know what conclusion should be made. Then we take a part of such data, that is labeled training data, and present it to the ML meta-algorithm. The machine is trained, and after that, it is capable to make standalone conclusions processing new data not known to us. Other meta-algorithms which are often used in Big Data are known as unsupervised ML methods or self-learning systems. Here some data set is presented to the system for self-learning and then, taking into consideration the formed structures of data, conclusions that allow solving initial problems are drawn. Various methods of data clusterization refer to such systems. They allow, for example, allocating categories of clients by the company loyalty or revealing abnormal behavior of objects. Finally, the third type of ML is called reinforcement ML. This group of algorithms includes those that are trained in the course of interaction with external environment and do not require a separate allocated data set for training. The learning process consists in developing such impact on the external environment so as to maximize assessment of the environmental response to these impacts. For example, it can be a self-learning management system for a self-driving car that is trained to drive a car in varying traffic flows. Reinforcement learning systems have started playing an important role among other ML methods. Recently, such a system AlfaGo has bitten the world champion in the Go game that was hardly expectable by experts owing to high semantic complexity of the game. Elon Musk, a business magnate, engineer and inventor, invested one billion dollars in establishment of the non-profit company OpenIA (Open Artificial Intelligence), and the tools for the development and testing of reinforcement ML algorithms became the first product of his company.
If we look deeper into the details of machine learning, we can find some more terms that can be regarded as specific marketing features, namely deep learning and decomposition of tensors. Later on, I will probably focus on such math concepts, but now I will try to formulate a general idea of this concept. Historically, machine learning has been developing for decades, and at all times it faced waves of success and failures when trying to solve more and more complex problems. And success came as a result of the fight of two main directions. One of them relied on the mathematically valid choice of criteria received while processing data and on their relations with semantics of the current task. Another direction was based on transfer of the extracted attributes to the machine and the maximum independence of machine learning meta-algorithm focusing on the final result only. Thus, the tensor decomposition term is on top of the first specified direction. It has passed through creation of smart algorithms of the LSA-semantic analysis for texts in a natural language, collaborative filtering in recommendation systems, the main component method and the method of non-negative matrix factorization, where a large matrix is factorized into a product of two substantially smaller ones. A tensor, as a generalization and multidimensional matrix, can also be presented by such a product of simpler components and, thus, the task becomes easier to understand using a language of these simpler structures.
If we talk about deep learning, then we focus on the development of such a neural network, where many layers interacting between themselves in the learning process organize a structure of coefficients that create attribute vectors without any semantic meaning. However, such a network gives results on the output layer that allow solving many problems of image recognition, converting phrases into numerical expressions, composing music, turning speech into a text and vice versa.
Perhaps, having read this article, you have some pending questions, please let me know, and I would be happy to answer them.