- Business Process Management
- Service oriented architecture (SOA)
- SV Group applications
- Enterprise Content Management (ECM)
- IT assets management
- Data backup
- Disaster Recovery
- IT infrastructure virtualization
- Proactive IT systems and application monitoring
- Big Data solutions
Big Data solutions
In the broadest sense, Big Data represents the processing and analysis of large volumes of data. You are dealing with “Big Data” if your data cannot be stored into your database and cannot be processed with classical tools, typically on one server. As technology advances, servers are becoming ever more powerful and the boundary is gradually being shifted upwards.
However, data challenges are not only due to the amount of data but also to their complexity. The total complexity is often described with the four Vs: volume, velocity, variety and veracity.
- Volume refers to the amount of data itself, which is generally on the terabyte and petabyte scale.
- Velocity refers to the rate at which large volumes of data arrive. Today, attention is increasingly being paid to data processing in real time.
- Variety refers to the diversity of data. Data often come in sets of different structured and unstructured forms and shapes, which affects the complexity of data analytics.
- Veracity refers to the level of unreliability and inaccuracy of data. The analytic processes often involve unstructured and “unclean data”, which may result in inaccurate conclusions.
Big Data in a narrow sense
In a narrow sense, the term Big Data describes a set of tools for processing, analysis and storage of data. Hadoop has made it possible to store and analyze data using inexpensive servers and has also brought Big Data analytics closer to smaller companies. In addition to Hadoop (which consists of a distributed file system – HDFS, MapReduce framework for the programming of and YARN system for execution of distributed batch jobs), there is also a large set of other tools and systems which together form the “Hadoop Ecosystem”. What follows is a survey of more important tools from this set.
- Kafka is a distributed and highly scalable system for sending and receiving messages. Kafka persists every message to disk so that it can also serve as a reliable layer for the asynchronous exchange of messages between the systems. Kafka Streams and Kafka Connect components make it possible to write programs for continuous processing and storage in real time.
- Apache Spark is a distributed data processing system. Spark batch jobs are considerably faster (up to as many as 100 times!) than their Hadoop MapReduce counterparts. In addition, it offers a highly concise and clear API, which provides functionalities for processing of structured data by using standard SQL (Spark SQL components), processing of unstructured data (Spark Core), organisation of data into graphs (Spark GraphX), algorithms for machine learning (Spark ML) and data processing in real time (Spark Streaming). Spark has gone mainstream and is a constituent part of all Hadooop distributions. It is used by a large number of companies across the world. The largest Spark clusters currently consist of up to hundreds of thousands of servers. Should you wish to receive more information on Spark, we recommend the book Spark in Action (www.sparkinaction.com), written by our colleague Petar Zečević.
- HBase is a distributed database which is optimized for fast data searching and reading. It is modelled after Google’s Big Table and stores data on the HDFS. It differs from classical relational databases by a number of characteristics: from the fact that it stores data by columns and not rows to the fact that it does not use classical indices, thus enabling the storage of different numbers and types of columns in different rows. It is one of the most commonly used databases in the Big Data world.
- Hive is a Hadoop data warehouse. Structured data managed by Hive are stored on the HDFS. Hive can also access external data sources, such as classical relational databases and Parquet or ORC files on the HDFS. HiveQL (version of SQL) is used to access and manage these data. Hive traditionally uses a slow MapReduce engine to deal with SQL queries, but implementation of Spark as its execution system is currently under way.
- Cassandra is yet another distributed database, also modelled after Google’s Big Table, but it is optimized for fast storage of data. It is often used for storing time-dependent (and other kinds) of data. It allows for flexible adjustment of consistency levels for each query and is tolerant to faults of nodes.
Space limitation does not allow us to mention many other tools and frameworks, more or less important. You can find more information about them here.
The large number of different tools which are generally used jointly in Big Data projects may lead to the incompatibility of their versions. In order to make their use as painless as possible, Hadoop distributions have been created to pack mutually compatible versions of tools into one large unit and to provide additional tools for cluster security and management (adding servers, upgrading versions, etc.).
Some of the best known Hadoop distributions include IBM Big Insights, Hortonworks, Cloudera CDH and MapR. IBM Big Insights, for example, consists of the following components: Ambari (used for cluster management), Knox and Ranger (used for cluster security), Avro, Flume, Hadoop, HBase, Hive, Kafka, Oozie, Parquet, Parquet-mr, Pig, Slider, Solr, Spark, Sqoop, ZooKeeperTitan, Phoenix, Text Analytics, Big SQL, Big R and BigSheets.
Cases of use
So, when should you use Big Data tools?
First of all, if you have more data than can efficiently be processed on one engine, it is recommended to migrate the processes on a distributed system. Similarly, if you expect a large influx of data in the future, it is wise to implement a Big Data system to prepare for such a situation.
A term closely related to Big Data projects is a data lake. It involves the idea of storing all data of an enterprise in a single place (typically on the HDFS and in Big Data databases) so that all interested parties can have easy access and better use of data. It is a fact that only a small share of available data are successfully used by enterprises. Business can be significantly improved if all available client and enterprise data are analyzed. Data available in one place also allows for predictive analytics by using statistical methods and methods of machine learning.
How can SV Group help you?
SV Group is highly experienced in creating Big Data solutions. We have successfully helped different companies to migrate classical processing jobs to Big Data technologies with the goal of their multiple acceleration. We have designed hardware and software Big Data infrastructures to support large-scale batch jobs. We have implemented applications for processing huge amounts of data in real time (streaming applications). We have also developed a solution for social network analysis by graph visualisation, and we also have expertise in machine learning and predictive analysis methods.
In short, feel free to turn to us with confidence because we are capable of supporting your Big Data project in all of its phases!
Let us conclude by citing the consultants of the company McKinsey, who wrote the following in their report “Big Data: The next frontier for innovation, competition and productivity”:
“The use of Big Data will become a key basis of competition and growth for individual firms, mostly computer and information, finance, insurance, and government sectors”