Big Data: Where to start?

“Do I learn Hadoop, Kafka, AWS – What in AWS stack?” The big data world is so big that it is humongous. Big data Engineer, Big data analyst, Big data scientist – Are these different names for the same role? It is all overwhelming to figure out which strand to take hold of. And how to climb that big mountain! To add on top of that - which algorithm to use, which tooling to use, which language to use.

Picture of satalite hovering above the earth, observing the weather Weather stations continuously use big data to predict the future

Let us start with understanding the role and responsibilities of these job titles. It can also serve us as a reference of skillset we need, if we want to do all of it by ourselves. Later, we shall dive deeper into what stack is usually recommended. Spoiler alert – There is no one recipe.

Data Analyst

The process of the extraction of information from a given pool of data is called data analytics. A data analystextracts the information through several methodologies like data cleaning, data conversion, and data modeling. There are several industries where data analytics is used, such as – technology, medicine, social science, business etc. Industries can now make careful data-driven decisions because they are able to analyze trends in the market, requirements of their clients and overview their performances with data analysis.

A Data Analyst is also well versed with several visualization techniques and tools. It is utmost necessary for the data analyst to have presentation skills. This allows them to communicate the results with the team and help them to reach proper solutions.

Data Analytics allows the industries to process fast queries to produce actionable results that are needed in a short duration of time. This restricts data analytics to a more short term growth of the industry where quick action is required.

Data Engineer

A Data Engineer is a person who specializes in preparing data for analytical usage. S/He develops the foundation for various data operations. A Data Engineer is responsible for designing the format for data scientists and analysts to work on.

They need to work with both structured and unstructured data. Data Engineers allow data scientists to carry out their data operations. They have to deal with Big Data where they engage in numerous operations like data cleaning, management, transformation, data deduplication etc.

A Data Engineer is more experienced with core programming concepts and algorithms. Therole of a data engineer also follows closely to that of a software engineer. This is because a data engineer is assigned to develop platforms and architecture that utilize guidelines of software development. For example, developing a cloud infrastructure to facilitate real-time analysis of data requires various development principles. Therefore, building an interface API is one of the responsibilities of a data engineer.

Furthermore, a data engineer has a good knowledge of engineering and testing tools. It is up to a data engineer to handle the entire pipelined architecture to handle log errors, agile testing, building fault-tolerant pipelines, administering databases and ensuring a stable pipeline.

Data Scientist

Nowadays, every company is looking for data scientists to increase their performance and optimize their production.

There is a massive explosion in data. This explosion is contributed by the advancements in computational technologies like High-Performance Computing. This has given industries a massive opportunity to unearth meaningful information from the data.

Companies extract data to analyze and gain insights about various trends and practices. In order to do so, they employ specialized data scientists who possess knowledge of statistical tools and programming skills. Moreover, a data scientist possesses knowledge of machine learning algorithms. These algorithms are responsible for predicting future events. Therefore, data science can be thought of as an ocean that includes all the data operations like data extraction, data processing, data analysis and data prediction to gain necessary insights.

However, Data Science is not a singular field. It is a quantitative field that shares its background with math, statistics and computer programming. With the help of data science, industries are qualified to make careful data-driven decisions.

The skills mentioned above can be summarized in the table below:

Data Analyst Data Engineer Data Scientist
Calculus and Linear Algebra * * ***
Data Intuition ** ** ***
Data Visualization and Communication *** ** ***
Data Wrangling * *** ***
Machine Learning * * ***
Programming Tools *** *** *
Software engineering * *** **
Statistics ** ** ***

Big Data programming language comparison

There is a plethora of programming languages today used for a variety of purposes.

We have compared a few in different aspects to make the decision-making process easier:

Scala Python R Java GO Julia
Speed
Ease of use
Quick Learning curve
Data Analysis capability
General-purpose
Big Data support
Interfacing with other languages
Production-ready

A much more detailed list of pros and cons can be found below

Python

Advantages

Disadvantages

R / Programming with Big Data in R (pbdR)

Advantages

Disadvantages

Java

Advantages

SQL

Retrieving data

Julia

Advantages

Disadvantages

Scala

Advantages

Disadvantages

MATLAB

Advantages

TensorFlow

Advantages

Go

Advantages

AWS for a big data project

Before analysis the data and making it useful, we need to set up the infrastructure. Setting up and managing data lakes involves a lot of manual and time-consuming tasks such as loading, transforming, securing, and auditing access to data. AWS Lake Formation automates many of those manual steps and reduces the time required to build a successful data lake from months to days.

Some of the available AWS Services are:

Use cases for AWS services in Big Data

Data Warehousing

Run SQL and complex, analytic queries against structured and unstructured data in your data warehouse and data lake, without the need for unnecessary data movement.

Big data processing

Quickly and easily process vast amounts of data in your data lake or on-premises for data engineering, data science development, and collaboration.

Real time analytics

Collect, process, and analyze streaming data, and load data streams directly into your data lakes, data stores, and analytics services so you can respond in real time.

Operational analytics

Search, explore, filter, aggregate, and visualize your data in near real time for application monitoring, log analytics, and clickstream analytics.

Apart from AWS services, whether it’s a trendy syntax language like Python or more conventional languages like Java and R, choosing the right programming language for big data really comes down to you and your business’ preference.

When starting out, it can be to take advantage of books and other free resources. Doing so can allow beginners to become more familiar with the terminology and build a strong foundation for future development. Those who are looking to make a more streamline move into the field, however, should look for opportunities to gain and practice the skills needed to become an expert data analyst.

One of the most efficient ways to do this is through numerous online short and long term courses.