Multi-Thread Processing
Overview of Multi-Thread Processing
Multi-Thread Processing is a mechanism that divides data and processes multiple threads in parallel to achieve high-speed performance.
The number of threads is automatically determined from multiple conditions such as the size of the data to be read and the number of CPU cores.
To perform Multi-Thread Processing, the components must support Multi-Thread Processing.
For the components that support Multi-Thread Processing, refer to Components that support Multi-Thread Processing.
If the input source or output destination of the components that perform Multi-Thread Processing don't support Multi-Thread Processing, even though the processing within the components are performed in parallel, they become a bottleneck for the realization of high-speed performance because the input data and result data are delivered in series.
By specifying components that support Multi-Thread Processing in the input source and output destination, you can take advantage of the Multi-Thread Processing characteristics to bring out the maximum performance. Multi-Thread Processing is most effective if the series of processing, such as read, convert, and write, all support Multi-Thread Processing.
Multi-Thread Processing architecture
Processing flow
Multi-Thread Processing operates with the following flow.
In the following explanation, the component that hands over the result data is component A and the component that receives the result data of component A is component B.
When component A and component B support Multi-Thread Processing
-
Component A divided data according to the data size to be read and the number of CPU cores.
-
Component B generates result data while reading the data divided by component A in multiple threads.
-
Component B outputs the result data of the multiple threads all at once.
When only component A supports Multi-Thread Processing
-
Component A divided data according to the data size to be read and the number of CPU cores.
-
Component A hands off the result data of the multiple threads all at once to component B.
-
Component B receives data from component A and generates result data.
-
Component B outputs the result data.
When only component B supports Multi-Thread Processing
-
Component A reads all the data.
-
Component A hands off the read data to component B.
-
Component B receives the data from component A, divides it based on the data size and the number of CPU cores, and generates the result data using multiple threads.
-
Component B outputs the result data of the multiple threads all at once.
Components that support Multi-Thread Processing
The components that support Multi-Thread Processing are as follows:
- Read CSV File
-
The size of the data to be read is divided in half until a certain amount is reached and is then processed in parallel using multiple threads. (Intermediate data may be output to temporary files depending on the data size and the number of CPU cores.)
In the default setting, Multi-Thread Processing is disabled.
If Enable multi-thread processing setting is selected, Multi-Thread Processing is performed if the component that receives the result data supports Multi-Thread Processing.
- Join
-
The input data that is handed over is processed in parallel using multiple threads and the intermediate data for the Join operation is output to temporary files and then joined.
The joined result is divided in the appropriate key groups and output.
- Aggregate
-
The input data that is handed over is processed and aggregated in parallel using multiple threads. (Intermediate data may be output to temporary files depending on the data size.)
The aggregated result is divided in the appropriate groups and output. If the group key isn't set, a division isn't performed.
- Sort
-
The input data that is handed over is processed and sorted in parallel using multiple threads. (Intermediate data may be output to temporary files depending on the data size and the number of CPU cores.)
To guarantee the order, output the sorted results without performing a division.
- Write CSV File
-
The input data that is handed over is processed in parallel using multiple threads. The order of the result data isn't guaranteed.
If the input source component is the following, Multi-Thread Processing is performed:
-
Read CSV File with Enable multi-thread processing setting selected
-
Join
-
Aggregate
-
Specification limitations
-
If the PSP data flow is enabled, Multi-Thread Processing isn't performed.
-
The processing time of the input source component for the component that performs Multi-Thread Processing may be displayed quite short in the execution history view.
-
Errors that occur in the input source of the component that performs Multi-Thread Processing may be output by the component that performs Multi-Thread Processing.
-
The error generated by Multi-Thread Processing may be output by the component that receives the Multi-Thread Processing result data.
-
The DataAlreadyUsedException error occurs when the following cases are executed:
-
Read CSV File with Enable multi-thread processing setting selected is used with the multiple components that perform Multi-Thread Processing
-
The result data is used with multiple components for the Join operation, the Aggregate operation that is set with the group key, or the Sort operation
-
-
When the result data is used with the following components or with specific conditions for the Join operation, the Aggregate operation that is set with the group key, or the Sort operation, an error occurs during execution:
-
In the Loop, Conditional Loop, or Loop by Number of Data operation, the data flow is drawn directly from the Multi-Thread Processing that is outside the loop to the components within the loop
-
Variable Mapper
-
The result data is assigned to a script variable in the Document Mapper
-
Merge Mapper
-
Sort by Key logic
-
Sort by Two Key logic
-
-
In the Loop, Conditional Loop, or Loop by Number of Data operation, the DataAlreadyUsedException error occurs during execution when the data flow is drawn directly from the Read CSV File operation that is set with Enable multi-thread processing setting which is outside the loop to the components that perform Multi-Thread Processing within the loop.