Multi-thread Processing

What is Multi-thread Processing

Multi-thread processing is a mechanism to accomplish high performance by partitioning and processing the data in multiple threads parallelly.
The number of threads will be determined automatically under multiple conditions such as the size of the read data or the number of CPU cores.

To perform multi-thread processing, it is required that the components support multi-thread processing.

If the component in which multi-threading processing is performed has non-supporting input source or output destination components, the handover of input data or result data will performed sequentially despite parallel processing in the component. As a result, the handover will be a performance bottleneck.
By specifying a component which supports multi-thread processing as the input source or output destination, the maximum performance can be achieved by utilizing multi-thread processing features to the full.
That is, multi-thread processing can manifest its maximum effect when it is supported by each component in a series of operations such as read-convert-write.

Architecture of Multi-thread Processing

Process Flow

Multi-thread processing will be executed as per the flow below.
(Hereinafter, assuming that the component which give out its result data as "Component A" and the one which receives the result data from Component A as "Component B".)

In the case that both component A and B support multi-thread processing,

  1. Component A: Partitions the data to be read depending on its size or the number of CPU cores.
  2. Component B: Reads the data partitioned by Component A with multiple threads and generates the result data meanwhile.
  3. Component B: Collects the result data of multiple threads and outputs it as a whole.

In the case that only component A supports multi-thread processing,

  1. Component A: Partitions the data to be read depending on its size or the number of CPU cores.
  2. Component A: Collects the result data of multiple threads and passes it over to component B as a whole.
  3. Component B: Receives the data from component A and generates the result data.
  4. Component B: Outputs the result data.

In the case that only component B supports multi-thread processing,

  1. Component A: Reads the entire data.
  2. Component A: Passes over the read data to component B.
  3. Component B: Receives the data from component A, partitions the data depending on its size or the number of CPU cores, and generates the result data in multiple threads.
  4. Component B: Collects the result data of multiple threads and outputs it as a whole.

Components Supporting Multi-thread Processing

The components supporting multi-thread processing are as follows.
Name Process overview Remarks
Read CSV File Partitions the data into halves until each partition is smaller than a certain amount, and processes the partitions with multiple threads parallelly. (The intermediate data might be output to temporary files depending to the size of data or the number of CPU cores.)
  • By default, multi-thread processing is disabled.
    Only when [Enable multi-thread processing setting] is checked and the component receiving the result data supports multi-thread processing, will multi-thread processing be performed.
Join Processes the received input data in multiple threads parallelly, outputs required intermediate data to temporary files and then joins them.
The joined result is output in the unit of appropriately partitioned key groups.
 
Aggregate Processes and aggregates the received input data in multiple threads parallelly. (The intermediate data might be output to temporary files depending on the size of data.)
The aggregated result is output in the unit of appropriately partitioned groups. If no group key is specified, partitioning will not be performed.
 
Sort Processes and sorts the received input data in multiple threads parallelly. (The intermediate data might be output to temporary files depending on the size of data or the number of CPU cores.)
The sorted result is output without partitioning in order to assure its order.
 
Write CSV File Processes the received input data in multiple threads parallelly. The order of the result data is not assured.
  • Multi-thread processing will be performed if the input source component is any of the following.
    • Read CSV File with [Enable multi-thread processing setting] being checked
    • Join
    • Aggregate

Specification Limits