What is Datalake? Know everything here

Who I am
Elia Tabuenca García
@eliatabuencagarcia
Author and references
What is Datalake? Know everything here

by Team AllYourVideogames | Jan 18, 2022 | Technology | 1 comment


Do you know what Datalake is? Did you know that companies are using this type of technology to improve data security and organize information in a better way? This resource is gaining more and more space because in many cases, the usual storage tools do not guarantee the agility and flexibility to reproduce insights that the business demands from a volume of data that is constantly growing.



What is Data lake?

The expression Data lake was conceived by James Dixon, Chief Technical Officer (CTO) of Pentaho, an open source program for business intelligence.

The term lake fits the expression perfectly because this technology gathers a set of data in its natural state. Thus, your data flows from various sources to the Data lake and is stored in original characteristics.

In this way, the Data lake is a kind of repository that gathers large and varied sets of information in native format. So with this technology, the person has an unrefined version of the data. This administration tool is increasingly used in businesses that need a vast repository to store data.

One of Datalake's strengths is that all data is retained, meaning nothing is stripped or filtered before storage. The data can be used whenever the person wants it and even never use it, but some care must be taken, as will be detailed later. Furthermore, they can be queried for various purposes, which is not the case when the data is refined for a specific purpose and reuse is more complicated.



Learn more: What is Google Drive and how to use it?

In the Data lake, information is only modified when it is taken for analysis, through the application of schemas. This procedure is called “reading schema” because the raw data remains with this characteristic until they are ready to be used.

Typically, Datalake enables insights gathering and reporting based on an ad-hoc data lake. This means that people don't have to constantly generate analytical reports from another platform or another type of repository. Thus, those who use this system can use a schema and automate copying a report if they need to.

This technology is a very useful system depending on your type of business, but you also need to pay attention to recurring maintenance. Without this management, there is the possibility that files become electronic waste, that is, they become inaccessible, heavy, expensive and useless.

Photo: publicity/AWS re:Invent ANT 316


The process in which data lakes run out of functionality is called data swamps.

Details of a DataLake

The Data Lake has some characteristics of its own, which are the following:

  • Gathered all user data in one place
  • Receives structured, semi-structured and unstructured data.
  • Great performance in intake and access consumption.
  • Small storage cost.
  • It has and follows security and data protection rules.
  • It separates storage from processing, which allows for great performance and good scale.

When are data lakes useful?

Data Lake technology can be useful when a person needs to work with a large amount of data. This means that, typically, the Data lake is used for a volume of petabytes or exabytes of data. To give you an idea, an exabyte is equivalent to a billion gigabytes.



If you use few file sources, a small amount of data, standardized information formats, and the entire process can be easily accessed and analyzed in a single database, it is very likely that using the Data lake is an obsolete and exaggerated tool. , which can even generate unnecessary situations and an unnecessary investment.

Now, if your business requires a high demand for data storage, the Data lake can be a welcome tool. To make it easier, if the following answers are positive, it is very likely that your business needs this tool:

  • Need to use a data stream (Click Streams for example)?
  • Does the stored data have multiple sources of origin?
  • Does the data have different formats?
  • Is the volume of data quite large (petabytes, exabytes)?

Anyway, before implementing this technology in your business, you need to study this tool a lot so that there are no problems in the future, which can even yield giant losses.

Data Warehouse e Data Lake

The Data Warehouse is also a data storage technology known in the market, however, this tool is intended for information that has already been treated and standardized, and which requires a greater financial investment. Thus, its greatest functionality is to provide a “clean” version of the information, aimed at a goal.

To be clear, within this segment there is a well-known analogy that speaks of a bottle and a lake that simplifies the difference between Data Warehouse and Data Lake. The Data Warehouse can be understood as water in a bottle, which comes from a single source, prepared for consumption. The Data lake can be seen as a lake, which has a large proportion of water storage in its natural form, being supplied by various sources.



So, the proposal of the two technologies is different, although both can store files. Therefore, anyone looking for this type of tool needs to understand, analyze their demands and make a project with the focus of estimating the amount of data that will be used in their business, with the focus on deciding on the alternative with the best cost-benefit for your demands.

On investment, Data lake storage cost is lower than Data Warehouses. Just be careful not to choose the cheapest technology and end up not meeting your demands, thus generating even more costs, the famous “cheap is expensive”.


Benefits of using Datalake

To be clear, here is a summary of the benefits of Data Lake:

  • Large data storage capacity.
  • It is compatible with any data format.
  • Accepts data modification at any time.
  • Allows simultaneous access to your data.
  • It offers the data in its raw state, which helps when it is necessary to make an analysis and generate a solution to possible problems, even being able to use other platforms.

Why use Data lake in the enterprise?

If you have identified the need to use the Data lake in your company or business, but you are still in doubt if it is really worth making this investment, then here are some reasons to join this tool.

First of all, keep in mind that data is part of the decision moments within a company, at least when the place uses a professional administration, even more so nowadays that several companies use a large volume of information.

Due to this large amount of data, without the ideal tools, the work of prospecting and modeling this data becomes a humanly impossible activity.

So, if your company has a demand for Datalake, it may be a good alternative to use this tool, as it will make all the difference when making decisions.

Check out some advantages of this technology.

Greater flexibility in data analysis

It is not all cases that the data analysis process begins in a clear way and with the information to be mixed ready for use. In cases like this, a Data lake will help to make it possible to mine information of various types that can be used as a starting point for future reports.

data enhancement

One of Data Lake's strongest points is precisely being able to store data in its original format, regardless of what it is, but there are techniques that help improve performance and data optimization. An example of this is transforming your data into Parquet format.

Parquet is a format that uses columnar storage instead of linear like CSV. To understand the benefit of this tip, in Apache Spark, for example, checks that take around 12 hours to be done by reading from a CSV format can be performed in up to an hour with Parquet, an eleven-fold optimization in response time.

Better management of large volumes of data

Many companies work with a volume of information in the range of terabytes or even more. In this way, the Data lake is the technology that provides the greatest practicality to ensure that the company's management will have the right amount of data to devise valuable insights.


Information security

Once you've decided that you're going to use the data lake, at the same time, you also need to plan your data security tools. Through settings and platforms specialized in this type of service, you can determine that only people who really need access to the information can enter the files and modify them.

Another point of attention is to think about the ideal degree of durability of the information. There are tools that allow you to manage this and depending on your decisions, the costs can be large or small.

In addition, you also need to check the encryption of the data. You can gather your keys, for example, from Amazon KMS and use them to encrypt and decrypt your information to increase the security of your data.

cost of technology

The Data lake, in addition to being cheaper than the Data Warehouse, is simpler to assimilate, as it does not need the entire architecture to structure the data. Because of this, the cost of implementing this technology in your business can fit in your pocket.

Points of Attention for Implementing the Data Lake

The Data lake is a tool that offers a virtual space in which it prioritizes providing a greater amount of storage than the quality of information.

And because of this great possibility of gathering data, it is necessary to be careful that the information does not become data swamp, which will make the files useless, which can generate huge losses.

Thus, one of the great challenges of installing the Data lake is to make this technology effective for the company, which means leaving the tool as an important source of information that can be structured for the defined purposes.

Look for a quality and reliable service

To be able to make the most of all the advantages of technology, you need to research well, find companies that offer this service and fit your demand. Find partners that unlock the full potential of your Data lake, in addition to allowing the integration of this technology with other tools, especially security ones.

Just don't forget to perform recurring maintenance so that your data doesn't become unusable. With all this in mind, create your plan of solutions and services to adhere to this technology.


add a comment of What is Datalake? Know everything here
Comment sent successfully! We will review it in the next few hours.