This is low hanging fruit but I meet some posts where is folks asked about what is the difference between 'Unique rows' and 'Unique rows (HashSet)' step for 'Pentaho Data Integration' (kettle)?
Both this steps is used to filter only unique rows in step input. Spoon's examples folder contains some good examples with test data on how this steps works. This examples is under
%PENTAHO_INSTALATION_FOLDER%\samples\transformations\
the files are
Unique Rows by Hashset - basic example.ktr
Unique - Duplicate.ktr
Unique - Case insensitive unique.ktr
So which of 'Unique' steps to use? This steps provide same result but behaves differently under the hood.
To have only unique output rows, input we have to presorted. This is very important, otherwise step will behave incorrectly. Input rows must be sorted by keys which will be used to determine duplicate rows. Why this strict precondition exists?
Lets take a look to algorithm that implement this step. When transformation started this step waiting for the first row for input. When first row came (on the example above first row will came from 'Sort rows' step) - 'Unique rows' step inspect first row for some keys identified as 'Fields to compare on'. If this fields names will not be found in the input step will fail immediately and all transformation will also fails. This will happen on run-time, not on 'preparing transformation' phase. If you have input like table input step this means connection with database will be established, data will be fetched and when first row will enter unique rows - all will fall down. Remember this if previous steps performs some weighted operations.
Row data in kettle is array of objects. By the name field step will try to find array indexes of 'fields to compare'. For all next rows this fields will be extracted by index, not by name. So rows structure for unique rows must be same. If you remember there is a warning in kettle if user try to split rows with different structure.
If none key fields for 'Unique rows' step is defined all row will be used to compare (that means all fields will be compared one by one). So there is to options possible - we will determine unique rows by all data in that row or only by a some fields.
Since we have first row at the beginning of transformation and some 'compare to' fields defined, step will copy only one current row to it's internal cache and output this first row since first row is always unique.
When the second row came to input step will compare this data with previous stored in cache. According to step settings this compare can be performed for some key fields or for the whole row (described above). Comparison will be delegated to metadata object. For String data java's native compareTo() will be called. But in general it depends on how does metadata implements compareTo().
So we have two cases - current and previous rows is same (not unique) or they are different (unique). Remember - we always compare to previous row. In case rows are different current row will be copied in step internal cache once again rewriting stored data and also this row will be passed to output. Otherwise not unique row will be swaped or if there is error handling available - row will be placed to error stream.
The advantage of this step that during execution always only one row is stored in internal cache. Step does not block all rows and outputs rows every time they are unique.
Imagine a case when we have row like:
before sotring after sorting
1) A 1 A 1
2) A 3 A 1
3) A 2 A 2
4) A 1 A 3
5) A 3 A 3
in first case unique row will return all 5 rows since
1) {A 3} is unique to {A 1}
2) {A 2} is unique to {A 3}
3) {A 1} is unique to {A 2} ...
for the sorted input there will be 3 unique rows
1) {A 1} will not be unique to {A 1}
2) {A 2} will be unique to {A 1}
3) {A 3} will be unique to {A 2}
4) {A 3} will not be unique to {A 3}
In case of comparsion was only by field contains A letter - in both cases there will be only one unique row. And only first row will be passed to output. All other will be skiped/ will got to error stream.
This step does not require input rows to be presorted since this step is using java HashSet implementation. This step read input rows and collect them in internal java HashSet, when there is no more input rows this step just ouputs rows from hash set to step output.
Pay attention:
Which step to use: 'sort' + 'unique rows' OR 'Unique rows (HashSet)'. I would say it depends on how much RAM memory do machine have. If data to process more then RAM amount kettle will run OutOfMemory with 'Unique rows (HashSet)' since HashTable will grow up to amount of all data.
Kettle 'sort' step can handle 'merge sort' algorithm with storing data on a hard drive in temp folder. So combination of 'sort' + 'unique rows' will give ability to process data bigger then RAM amount and performance will be equal to sort step performance.
But always pay attention on what rules rows expected to be unique.
Have a unique rows only and successful kettle transformations.
Both this steps is used to filter only unique rows in step input. Spoon's examples folder contains some good examples with test data on how this steps works. This examples is under
%PENTAHO_INSTALATION_FOLDER%\samples\transformations\
the files are
Unique Rows by Hashset - basic example.ktr
Unique - Duplicate.ktr
Unique - Case insensitive unique.ktr
So which of 'Unique' steps to use? This steps provide same result but behaves differently under the hood.
'Unique rows'
Lets take a look to algorithm that implement this step. When transformation started this step waiting for the first row for input. When first row came (on the example above first row will came from 'Sort rows' step) - 'Unique rows' step inspect first row for some keys identified as 'Fields to compare on'. If this fields names will not be found in the input step will fail immediately and all transformation will also fails. This will happen on run-time, not on 'preparing transformation' phase. If you have input like table input step this means connection with database will be established, data will be fetched and when first row will enter unique rows - all will fall down. Remember this if previous steps performs some weighted operations.
Row data in kettle is array of objects. By the name field step will try to find array indexes of 'fields to compare'. For all next rows this fields will be extracted by index, not by name. So rows structure for unique rows must be same. If you remember there is a warning in kettle if user try to split rows with different structure.
If none key fields for 'Unique rows' step is defined all row will be used to compare (that means all fields will be compared one by one). So there is to options possible - we will determine unique rows by all data in that row or only by a some fields.
Since we have first row at the beginning of transformation and some 'compare to' fields defined, step will copy only one current row to it's internal cache and output this first row since first row is always unique.
When the second row came to input step will compare this data with previous stored in cache. According to step settings this compare can be performed for some key fields or for the whole row (described above). Comparison will be delegated to metadata object. For String data java's native compareTo() will be called. But in general it depends on how does metadata implements compareTo().
So we have two cases - current and previous rows is same (not unique) or they are different (unique). Remember - we always compare to previous row. In case rows are different current row will be copied in step internal cache once again rewriting stored data and also this row will be passed to output. Otherwise not unique row will be swaped or if there is error handling available - row will be placed to error stream.
The advantage of this step that during execution always only one row is stored in internal cache. Step does not block all rows and outputs rows every time they are unique.
Imagine a case when we have row like:
before sotring after sorting
1) A 1 A 1
2) A 3 A 1
3) A 2 A 2
4) A 1 A 3
5) A 3 A 3
in first case unique row will return all 5 rows since
1) {A 3} is unique to {A 1}
2) {A 2} is unique to {A 3}
3) {A 1} is unique to {A 2} ...
for the sorted input there will be 3 unique rows
1) {A 1} will not be unique to {A 1}
2) {A 2} will be unique to {A 1}
3) {A 3} will be unique to {A 2}
4) {A 3} will not be unique to {A 3}
In case of comparsion was only by field contains A letter - in both cases there will be only one unique row. And only first row will be passed to output. All other will be skiped/ will got to error stream.
'Unique rows (HashSet)'
Pay attention:
- it does not guarantee that the order of rows will remain constant over time
- it blocks the flow before all input will be read off
- it will consume amount of RAM memory approximately equal to all rows size in memory
Which step to use: 'sort' + 'unique rows' OR 'Unique rows (HashSet)'. I would say it depends on how much RAM memory do machine have. If data to process more then RAM amount kettle will run OutOfMemory with 'Unique rows (HashSet)' since HashTable will grow up to amount of all data.
Kettle 'sort' step can handle 'merge sort' algorithm with storing data on a hard drive in temp folder. So combination of 'sort' + 'unique rows' will give ability to process data bigger then RAM amount and performance will be equal to sort step performance.
But always pay attention on what rules rows expected to be unique.
Have a unique rows only and successful kettle transformations.