Characteristics Required for Micro Data Lakes
For those who are reading this article for the first time, it is recommended that you read the first article (The Rise of Micro Data Lakes) before reading this one.
Introduction
If we were to define the most significant characteristic of micro data lakes, it would be "specific purpose-oriented". In other words, while cloud-based data lakes are extremely generalized, edge-based solutions provide optimal functionality tailored to specific business purposes. This is due to the limitations of edge computing. Examples include:
Data lakes for processing large-scale sensor data
Data lakes for processing large-scale unstructured image data
Data lakes for processing large-scale video and visual data
Data lakes for processing large-scale unstructured audio data such as voice and noise
Data lakes for processing and analyzing large-scale network-related data (packets, sessions)
There could be many more types of specialized data lakes beyond these, and in the future, the range of functions provided by data lakes is likely to expand much further.
Hardware
This is quite an ambiguous area. The reason is that the gray area where the edge device realm and the server realm intersect is too large. As time passes and computing power and functionality improve, hardware with server-grade specs from a few years ago can become today's edge devices. Nevertheless, let's summarize based on today's common concept of what is called an edge:
CPU
In most cases, x86 and x64 type CPUs form the mainstream, generally showing performance of 1.5 ~ 2.5GHz, and the number ranges from a minimum of 1 to 8 or 16, which are classified as edge servers. In particular, recently, ARM-based CPUs are also being provided with considerably high performance and low cost, so they are increasingly being used as edge servers.
Memory
Generally, 1G to 4G or 8G of memory is common, and the cost doesn't differ much. In case virtualization is needed, more memory might be installed, but this is considered to be beyond the functional scope of an edge server.
Network
Usually, edge servers provide network performance of 1G to 10G. Due to the characteristics of edge servers, they may also provide two or more ports to distinguish between internal and external networks.
Disk (Storage) Type/Size
In the case of very small edge devices that are not server-grade, unstable MicroSD might be provided, but generally, edge servers provide SSD-based data storage of about 256GB. This storage is for data storage and processing, which is the main purpose of a data lake, so users can install an appropriate size depending on what data they will process and for what purpose.
Types of Data
Structured Data Processing
In most cases, the structured data in business areas requiring edge servers are large volumes of "sensor" or "tag" data. Other structured data could include network packets or related session information, and other time-series network data quality metrics that occur continuously.
In conclusion, it's fair to say that one of the main purposes of a "micro data lake" is to store and process these large volumes of structured time-series data most effectively.
Unstructured Data Processing
The most representative unstructured data in the edge domain is vision information through cameras. This is most closely linked with AI applications and is most widely used in areas of detecting abnormal behaviors and quickly notifying users to prevent risks and disasters.
However, there's room for debate on whether this should use the specific functionality of a "data lake". This is because, in the case of such vision data, it's stored as general files, and there's little room for fast searching through indexing or compression. The collection of vision data becomes meaningful when it's used as raw data (tagging) for AI training.
However, if we imagine a case where such unstructured data becomes meaningful at the edge, it would be when it's stored and preserved in combination with time information. In this case, it's very useful from the perspective of analyzing how changes in specific sensors or tags in a specific time range are related to actual changes in "vision (image)". For example, in equipment like tanks, it could be used to analyze how external impacts or flames captured by cameras quantitatively affect internal sensor data, or conversely, to understand how sudden changes in data captured by internal sensors are related to external visual information.
Data Collection Performance
If we only consider data collection performance, saving directly as plain text files would be the fastest method in the world. However, in a "micro data lake", the concept of collection must satisfy all three of the following elements to fulfill its role as a "data lake". Otherwise, it's no different from ordinary disk space!
Data Storage Performance
Since the input data is likely to be in unstructured or semi-structured form, it needs to be quickly and easily converted into the data format defined by the "data lake" and rapidly transferred to physical storage. This is the same question as "How can we maximize the use of disk I/O bandwidth?", and various methods to use maximum bandwidth should be considered.
Data Indexing Performance
This is a very important element. Indexing should occur in real-time as soon as the data is stored. A representative big data solution that doesn't do this is "Hadoop", which is built in separate stages of storage (writing to disk as plain text) and index generation (or analysis) (typically map-reduce). (Real-time performance is very poor)
Given that edge computing assumes all operations following data generation are "real-time", big data solutions like "Hadoop" seem impossible to utilize, and even if implemented, would be too inefficient compared to their purpose. Another example is removing indexes in a general RDBMS to maximize input performance, but this too should be seen as diverging from the original purpose.
Data Compression Performance
This is also very important. Due to the nature of edge servers, once installed, it's very difficult to dynamically change the hardware storage configuration. (It may be almost impossible if data is already input) This is because it's physically distant, lacking the flexibility (elasticity) advantage of cloud computing. Therefore, to maximize the use of limited storage resources, it's essential that data is compressed immediately upon collection and can be utilized in its compressed state.
If any of these three characteristics are lacking, it can be said to fail as a "micro data lake".
Indeed, it's not easy.
Data Manipulation Performance
What do users want to do through this data lake?
Real-time Data Search/Extraction Performance
It must satisfy the most important and fundamental requirement of search performance. Since edge computing is intended to compensate for the slow response time of cloud computing, the time between data generation and data extraction should be minimized as much as possible. For this to be possible, it is closely related to the data indexing performance mentioned above, and extraction performance using indexes will only be meaningful after collection performance is maximized.
Time Series Data Analysis Performance
This is highly related to the processing of unstructured data such as images or videos mentioned earlier. In other words, if a specific event occurred at a specific time, it should be possible to immediately access related data such as images, sound, noise, etc. during that period. This is related to how the data lake interconnects the two data types. For example, it would be an essential function in cases where large-scale sensor data and video (image) data are generated closely together, like in autonomous vehicles.
Statistical Data Extraction Performance
The micro data lake on edge servers should be considered almost impossible to execute "long term analytic queries". This is because edge servers need to be operated in a state where service is possible 24/7, 365 days a year, and if analytical queries are performed here, excessively using resources and causing disruptions to existing services would be a big problem. In other words, it also means that the "micro data lake" should have an answer on how to quickly provide the statistical/analytical data needed by users.
On the other hand, cloud-based data lakes are much freer in this aspect. This is because those data lakes are mostly configured for the purpose of "analysis", not for "real-time service".
At this point, it should be considered that the "micro data lake" has more demanding characteristics.
Data Movement
Data collected at the edge is destined to be moved somewhere periodically, whether in part or in whole. This is because the size of that storage is limited. Of course, data can be deleted, but at least alarms, events, or very important data records in certain areas must be moved to the cloud or upper layers. Therefore, when such data needs to be leaked externally, the load when searching, storing, extracting, or replicating through networks, storage, etc. should be minimized without affecting the service.
Data Security
This issue can be viewed from three main perspectives, so let's describe each in detail.
Data Confidentiality (Encryption)
This concerns how to safely manage information in case the edge device is stolen or the storage medium is leaked externally. So far, in most cases, there haven't been requirements for encrypting data on edge servers. Perhaps this is because sensor or equipment data is thought to be outside the scope of "personal information protection". If the standard encryption method set by the government is not enforced, there are really diverse data encryption/decryption techniques, so there shouldn't be any difficulty in implementation. In reality, cases requiring encryption on edge servers are very rare.
Security of Data Transmission (Communication Medium Encryption)
Generally, communication between servers and edges can be considered to mostly occur through MQTT. Also, since this communication mostly utilizes standard encryption technology that we know, there shouldn't be major difficulties in implementation. If a special communication method other than MQTT is used, separate considerations would be necessary.
Physical Location of Storage Space
Generally, another reason for storing data on edge servers is the requirement of companies that don't want their actual data to be transmitted to the cloud. If it's impossible to build a "micro data lake", they would have to install a private cloud or collect data on their own servers located in a separate computer room, for the reason of wanting the physical storage location of the data to be within the category of the company. In conclusion, if a "micro data lake" is actually possible, it also means that it's quite safe from a data security perspective to use a public cloud for the server responsible for management and control.
Scalability
Currently, scale-up is likely the only option for scalability of edge servers. At some point in the future, even for edge servers, scale-out type scalability might be needed, but in the near future, it's thought that it will mostly be limited to the former.
High Availability
This is the biggest concern and the most difficult part in edge computing. Due to its nature, the edge is bound to be a "single point of failure", but in special fields of continuous processes depending on the type of business, this is not allowed. Therefore, edge servers are also duplicated, and it seems that implementation should be done considering the cost-effectiveness of simply duplicating computing versus duplicating storage as well. (So far, I haven't seen a case built to this level at the edge level)
Cloud and Edge Integration and Service
This is also a topic with quite important issues that are easy to overlook in reality.
Let's assume that only some of the data is in the cloud (or server), and most of the data is stored in the micro data lake at the edge. If a user remotely accesses this micro data lake's data for various purposes (monitoring, auditing, history checking and event searching, visualization), how can we make these requests possible? This is because most edge servers would be in a specific company's internal network, so to connect to each edge server from the cloud or external network for arbitrary purposes, special methods need to be devised.
If this is possible, it would allow very transparent manipulation of data from the cloud (server) side. From the end user's perspective, an innovative data integration model becomes possible where data is abstracted so that they can't distinguish whether the data they're currently viewing is in the cloud or at the edge, and they can manipulate and visualize data on both sides without leaking data to the external network.
Isn't this architecture the completed form of a true "micro data lake"?