Data Lakes vs. Data Warehouses: What Are The Differences?

on August 18th, 2016
Data technology

In the wide world of technology, it’s often hard to determine when a new, innovative data technology has truly arrived and when IT has just stumbled across a fun little buzzword to pass around for a bit and blog about. Such is the case with the data lake. Some believe it’s just a new spin on the old data warehouse. Others believe it’s a disaster waiting to happen — a repository that speedily devolves into a swamp of data that can never again be retrieved.

In truth, the data lake is a real thing, and it is a viable solution, either in lieu of or in addition to the standard data warehouse. There are differences, though, so don’t be fooled into thinking it’s just a new way to look at the same old data warehouse. Here are the ways that a data lake is different from a data warehouse, and how to tell which you actually need.

Data Lakes Hold All Data

Data warehouses hold specific data sets that are stored statically in a specific, predefined structure. The data is sanitized, organized, and structured before being stored in the data warehouse. Conversely, all data from all data sources can be stored and even streamed into the data warehouse. This data technology is an ideal solution for storing data streaming in from the IoT.

Data Lakes Support All Data Formats

In an organization with numerous disparate systems, all in different formats, the data lake allows you to house all of the data together, while maintaining its original format. For example, say you’re in an enterprise environment that depends on twenty or more different software applications. A data lake allows you to store all of the data from all these systems in a single repository, again, while keeping the original format. That’s important, because it leaves the possibilities for analytics wide open. A data warehouse strips the native formatting, hamstringing the data analytics team when it comes to being creative with their analytics.

Data Lakes Enable All Users

Ideally, the data lake is structured so that all users can easily query and retrieve data from the lake. This is different from the traditional data technology, which is generally guarded by IT somewhat like the Royal Guard watch over Buckingham Palace. While the typical data warehouse allows access by applications, the data itself is strictly off limits to the users. A data lake allows users to retrieve data, work with it, and even restore it with all the changes, thereby adding to the potential analytical uses for the data.

Data Lakes are Easier to Change

Data warehouses are notoriously set in concrete as far as structure, and making changes is incredibly difficult, time-consuming, and frustrating. The data lake is more fluid and agile, allowing for changes more quickly and easily, though not completely without frustration, unfortunately.

How Can You Decide Between a Data Lake & Data Warehouse?

If you have an established data warehouse and it’s serving the intended purposes nicely, there is no reason to tear it down and try to rebuild Rome. You can add a data lake and use the two together. However, if you’re preparing to build a data repository from scratch, a data lake is ideally suited to today’s modern data store, which includes large (and growing) quantities of unstructured data. Unstructured data like files and documents, social data, Web data, and other types of data play particularly poorly with a traditional relational database. Pick a data lake instead.

To learn more about data analytics softwares, data lakes and all things involving data technology, follow us on Twitter.