A huge amount of data is generated daily in companies, so tools are needed for its processing and analysis, its conversion into useful and profitable knowledge for organizations and to help them in decision making.
The term Big Data refers to the set of large and complex data that traditional tools, such as relational databases, are unable to process in an acceptable time range or within a reasonable cost range. The problems occur in data extraction, searches, flows or movements, storage, processing and analysis, since traditional tools, as already mentioned, cannot solve them.
Thus, in recent years, trends and the concept of Big Data have arisen to refer to voluminous data sets that exceed the handling capacity of traditional tools (usually in the range of terabytes, petabytes and higher magnitudes). Data volume, however, is not the only property important to its definition.
Data sources are very numerous, but today, in addition to data from traditional sources, information systems, corporate (transactional) databases, and archives that typically handle structured data with defined formats, are fed with large volumes of data that have different formats, unstructured and semi-structured. These large volumes of data are known as Big Data and cannot be processed using traditional relational database tools or, if such tasks could be performed, the processing times would be enormous. Consequently, techniques and tools different from the traditional ones are needed for their efficient and reliable processing. Data currently comes from numerous sources, as already mentioned:
- Information systems (ERP, CRM, SCM, GIS)
- Legacy data (from old databases)
- Relational databases and archives
- E-mails
- Text messages
- XML files
- Web portals
- Social media (b/ogs, social networks, wikis)
- Private networks
- Multimedia (images, sound, video)
- Streaming data (continuous data flow, text, video, audio)
- Machine data (M2M, machine to machine)
- Sensors
- Biometric data
- Human-generated data
Structured data, or table-like data, comes from traditional databases and archives. The rest of the data is known as unstructured data or semi-structured data and is very difficult to handle by traditional tools. For these reasons the new trend called Big Data has appeared.
Definition of Big Data
The term Big Data was coined by Doug Laney, an analyst at the consulting firm Gartner, in 2001, to refer to the entire set of data whose quantity or volume -usually terabytes or petabytes-, speed and variety exceed the capacity of traditional tools to manipulate and process the information. Laney was referring not only to the volume of data, but also to the speed of data generation and the wide variety of formats. This model is known as the 3V model of Big Data:
- Volume: Overall size of the data set, terabytes and petabytes, although many companies already generate exabytes of information.
- Velocity: Time used in generating the data, as well as the speed at which it needs to be processed: in real time or near real time.
- Variety: Wide range of data that can be contained in data sets that come from very diverse sources: web pages, text, audio, video, photographs, sensors, machine data, mobile device data, and so on. Data are classified into three types: structured (data from relational and legacy databases, in table format), unstructured (audio, text, photographs), semi-structured (text files, XML files, etc.).
Bernard Marr, one of the great gurus of Big Data, naturally considers that all kinds of data are tracked and stored and that large volumes of data are accessed; however, Marr states that the real value of Big Data is not in the large volumes of data and their three fundamental properties, but in what we can do with them. It’s not the amount of information that makes the difference, it’s our ability to analyze large, complex sets that go beyond anything we could have done before and its overall impact is the analysis of that data, the great ability to turn huge amounts of complex data into value.
Types of Data in Big data
Structured data: Traditional data stored in rows and columns (tables) and are the most commonly used in ordinary organizational files and databases.
Semi-structured data: These data do not conform to a fixed and explicit schema; they are not limited to specific fields, they maintain markers to separate elements. They have little regular information, so they cannot be managed in a standard way; they use hypertext markup or extensible markup languages. Examples of such data are XML documents, HTML, sensor data, and so on.
Unstructured data: This is the most complex data; it is presented in formats that cannot be easily manipulated by relational databases: Word files, PDFs, PPT, spreadsheets, multimedia documents, audio, voice, video, photographs, e-mails.