Optimized data splitting


We would like to optimize our usage of the system.

1.  When splitting the data stream what are the optimal settings?

2.  Should we determine the number of records to split based on the size of the files being processed?

3.  Is there a maximum concurrent processors we can use during parallel processing?

4.  What would be the downfalls of setting the maximum concurrent processors to 10, 20 or 50?



The splitting and the parallel processing is affected by number of parameters like number of columns and size of the data per record , CPU configuration, available memory, java heap size, and number of other processes running. There is no generic optimized splitting parameters that are applicable for all the mappings. A split size of 500 with 10 concurrent process may be a performance booster for one of the mapping and the similar configuration may degrade the performance in other mappings (depending on the size of data per split, memory etc). So these parameters differ from mapping to mapping.

As per general recommendation for a particular mapping in the development environment, you can start with the split size of 200 with 10 concurrent processes as long as it stays within 50% of the allotted java heap size and if other concurrent processes are not consuming memory. To optimize these numbers you need to take all the impacting parameters into consideration and tweak these numbers one by one (increasing concurrent processes first and then split size) and see the results . We have seen that typically with lower split count and higher concurrency increasing the performance.

Have more questions? Submit a request


Article is closed for comments.