“ Mach Speed Horizontally Scalable Time series database. ”
Semiconductor Production Data using MACHBASE(Time-Series Database)
- Problems with data processing
- Definition of Target Data for Search
- Data Preprocessing for Machbase TAG Table
- Data Retrieval
Semiconductor production data consists of a combination of information from various sensors attached to manufacturing equipment and details about the produce chip. This data is substantial, and depending on the frequency at which sensors transmit information, the production data for just one semiconductor wafer can reach several gigabytes.
Semiconductor production data is being recorded in the format described above. This data is transformed into XML files and stored in systems like Hadoop, which is a NoSQL big data processing system, for analysis. Alternatively, the mentioned data is also processed by storing it in the form of XML-formatted BLOB columns within a relational database management system (RDBMS).
In reality, the data from each sensor is included within the wafer XML files, making it unsuitable for examining the trends of sensor values over time.
The method of recording sensor data based on the production item (as described in the introduction) has encountered various issues in practical situations. These issues are commonly encountered in the processing of large volumes of sensor data, and we will once again examine them here.
· Using Hadoop, handling massive and rapid data input is not an issue. However, when it comes to reading sensor data for desired statistical analysis, it takes too much time, making real-time analysis unfeasible. Introducing tools like Spark can enhance search performance, but setting up a large-scale cluster is necessary, leading to significant hardware investment costs.
· RDBMS encounters issues with slow data input speeds, causing problems when trying to input the desired amount of data. While certain appliances can improve input speeds to a certain extent, substantial costs are involved, and a satisfactory level is still not achieved.
To address these challenges, we have been recently experimenting with using a specialized database for time-series sensor data called “Machbase” to handle data processing. In this post, we will explore how to configure the database structure of Machbase to achieve fast data input and enable users to perform the desired data searches.
The Machbase Tag table efficiently stores sensor values on an hourly basis, allowing for swift searches based on sensor tag identifiers and time ranges. This characteristic also extends to the sensor data of a myriad of aquariums. However, when it comes to storing and retrieving production information (such as which wafer each sensor was producing at a specific time, and the recipe employed at that moment), the performance of this Tag table faces limitations. An example of the search interface desired by the customers is as follows:
In this illustration, the customer wants to visualize sensor data or display statistical values for each sensor without specifying a particular sensor tag or time range. To do this, he selects the equipment module where the sensor is installed instead of a specific time range, and the lot number or wafer number of the produced wafers. To expedite this data retrieval, we need to input the semiconductor production wafer data through further processing.
As mentioned earlier, semiconductor production data recorde in a manner where sensor data is dependent on specific wafers or lot products. However, for better performance in Markbase, it’s beneficial to separate sensor data from process data. Therefore, data needs to transform into the following format.
In the figure above, for each production lot, the time it enters and exits each machine and the corresponding lot number is recorded. This information is stored in the table “Lot_eq_start_end”. We define this table as process data (PROCESS_DATA).
Since each sensor is installed on a specific piece of equipment, the tag identifier of each sensor is retrievable in relation to that piece of production equipment. This sensor tag is considered metadata. An E-R diagram representation of these relationships is shown below.
Now that we have metadata for searching sensor tag IDs by equipment and process data for the manufactured products, the desired outcome can be achieved by joining the three tables together.
In Machbase, when conducting a search based on specific equipment and lot criteria for 1 million tags and a total of 10 billion records, we can retrieve a total of 900,000 search results in less than 0.1 seconds. This observes on hardware with 4 cores and 32GB SSD. Furthermore, it confirmes that maintaining consistent performance is possible even when scaling up the total volume of data.
We have examined the limitations of using Hadoop and RDBMS for processing semiconductor production data and the solutions provided by time-series DBMS. Markbase’s time-series DBMS offers exceptional high-speed capabilities for data input, retrieval, and analysis, surpassing the constraints of conventional data processing systems in semiconductor production. We have also explored the data processing and retrieval methods necessary when applying Markbase.
This approach holds value not only for semiconductor production but also for various other industries. If there are additional technical details or advancements in the future, I will address them in separate posts.
Kwanghoon Shim, CRO of Machbase