четверг, 16 апреля 2015 г.

The difference between 'Unique rows' and 'Unique rows (HashSet)'

This is low hanging fruit but I meet some posts where is folks asked about what is the difference between 'Unique rows' and 'Unique rows (HashSet)' step for 'Pentaho Data Integration' (kettle)?

Both this steps is used to filter only unique rows in step input. Spoon's examples folder contains some good examples with test data on how this steps works. This examples is under


the files are

Unique Rows by Hashset - basic example.ktr
Unique - Duplicate.ktr
Unique - Case insensitive unique.ktr

So which of 'Unique' steps to use? This steps provide same result but behaves differently under the hood.

'Unique rows'

To have only unique output rows, input we have to presorted. This is very important, otherwise step will behave incorrectly. Input rows must be sorted by keys which will be used to determine duplicate rows. Why this strict precondition exists?

Lets take a look to algorithm that implement this step. When transformation started this step waiting for the first row for input. When first row came (on the example above first row will came from 'Sort rows' step) - 'Unique rows' step inspect first row for some keys identified as 'Fields to compare on'. If this fields names will not be found in the input step will fail immediately and all transformation will also fails. This will happen on run-time, not on 'preparing transformation' phase. If you have input like table input step this means connection with database will be established, data will be fetched and when first row will enter unique rows - all will fall down. Remember this if previous steps performs some weighted operations.
Row data in kettle is array of objects. By the name field step will try to find array indexes of  'fields to compare'. For all next rows this fields will be extracted by index, not by name. So rows structure for unique rows must be same. If you remember there is a warning in kettle if user try to split rows with different structure.
If none key fields for 'Unique rows' step is defined all row will be used to compare (that means all fields will be compared one by one). So there is to options possible - we will determine unique rows by all data in that row or only by a some fields.
Since we have first row at the beginning of transformation and some 'compare to' fields defined, step will copy only one current row to it's internal cache and output this first row since first row is always unique.
When the second row came to input step will compare this data with previous stored in cache. According to step settings this compare can be performed for some key fields or for the whole row (described above). Comparison will be delegated to metadata object. For String data java's native compareTo() will be called. But in general it depends on how does metadata implements compareTo().
So we have two cases - current and previous rows is same (not unique) or they are different (unique). Remember - we always compare to previous row. In case rows are different current row will be copied in step internal cache once again rewriting stored data and also this row will be passed to output. Otherwise not unique row will be swaped or if there is error handling available - row will be placed to error stream.
The advantage of this step that during execution always only one row is stored in internal cache. Step does not block all rows and outputs rows every time they are unique.

Imagine a case when we have row like:
    before sotring     after sorting
1)    A 1                     A 1
2)    A 3                     A 1
3)    A 2                     A 2

4)    A 1                     A 3
5)    A 3                     A 3

in first case unique row will return all 5 rows since
1) {A 3} is unique to {A 1}
2) {A 2} is unique to {A 3}
3) {A 1} is unique to {A 2} ...

for the sorted input there will be 3 unique rows
1) {A 1} will not be unique to {A 1}
2) {A 2} will be unique to {A 1}
3) {A 3} will be unique to {A 2}
4) {A 3} will not be unique to {A 3}

In case of comparsion was only by field contains A letter - in both cases there will be only one unique row. And only first row will be passed to output. All other will be skiped/ will got to error stream.

'Unique rows (HashSet)'

This step does not require input rows to be presorted since this step is using java HashSet implementation. This step read input rows and collect them in internal java HashSet, when there is no more input rows this step just ouputs rows from hash set to step output.
Pay attention:
  • it does not guarantee that the order of rows will remain constant over time
  • it blocks the flow before all input will be read off
  • it will consume amount of RAM memory approximately equal to all rows size in memory 
Generally speaking this step is usable for small amount of data. It doesn't give a performance advantage since it uses hashing to determine uniqueness.

Which step to use: 'sort' + 'unique rows' OR 'Unique rows (HashSet)'. I would say it depends on how much RAM memory do machine have. If data to process more then RAM amount kettle will run OutOfMemory with 'Unique rows (HashSet)' since HashTable will grow up to amount of all data.
Kettle 'sort' step can handle 'merge sort' algorithm with storing data on a hard drive in temp folder. So combination of 'sort' + 'unique rows' will give ability to process data bigger then RAM amount and performance will be equal to sort step performance.

But always pay attention on what rules rows expected to be unique.

Have a unique rows only and successful kettle transformations.

четверг, 4 декабря 2014 г.

Pentaho MapReduce job entry fails to execute.

The one more thing that can save you a lot time when playing with pentaho and quickstart.cloudera image is correct settings for our hadoop cluster.

As usual quickstart.cloudera is a VM image that connected to you host computer with VirtualBox bridged mode. From that point of view this simulates situation when 2 machines takes in one network from different adresses.

What do we have is a Spoon job with Pentaho MapReduce job entry attempts to running some mapper (and reducer if necessary)  transformation. See screenshot above:

This behaves strange, after hitting start button this hangs out for about 5 minutes and then crushing with exception like:

2014/12/04 11:10:36 - Pentaho MapReduce - ERROR (version 5.2-NIGHTLY-125, build 1 from 2014-11-25_18-58-14 by buildguy) : Call From *****/***** to quickstart.cloudera:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2014/12/04 11:10:36 - Pentaho MapReduce - ERROR (version 5.2-NIGHTLY-125, build 1 from 2014-11-25_18-58-14 by buildguy) : java.net.ConnectException: Call From
*****/***** to quickstart.cloudera:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2014/12/04 11:10:36 - Pentaho MapReduce -     at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at java.lang.reflect.Constructor.newInstance(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client.call(Client.java:1413)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client.call(Client.java:1362)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
2014/12/04 11:10:36 - Pentaho MapReduce -     at com.sun.proxy.$Proxy35.getNewApplication(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:172)
2014/12/04 11:10:36 - Pentaho MapReduce -     at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at java.lang.reflect.Method.invoke(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2014/12/04 11:10:36 - Pentaho MapReduce -     at com.sun.proxy.$Proxy36.getNewApplication(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:134)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:142)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:175)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:230)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:357)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
2014/12/04 11:10:36 - Pentaho MapReduce -     at java.security.AccessController.doPrivileged(Native Method)
2014/12/04 11:10:36 - Pentaho MapReduce -     at javax.security.auth.Subject.doAs(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.hadoop.shim.cdh51.HadoopShim.submitJob(HadoopShim.java:83)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.hadoop.shim.cdh51.delegating.DelegatingHadoopShim.submitJob(DelegatingHadoopShim.java:112)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.execute(JobEntryHadoopTransJobExecutor.java:850)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.di.job.Job.execute(Job.java:716)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.di.job.Job.execute(Job.java:859)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.di.job.Job.execute(Job.java:859)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.di.job.Job.execute(Job.java:532)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.pentaho.di.job.Job.run(Job.java:424)
2014/12/04 11:10:36 - Pentaho MapReduce - Caused by: java.net.ConnectException: Connection refused: no further information
2014/12/04 11:10:36 - Pentaho MapReduce -     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
2014/12/04 11:10:36 - Pentaho MapReduce -     at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
2014/12/04 11:10:36 - Pentaho MapReduce -     at org.apache.hadoop.ipc.Client.call(Client.java:1380)
2014/12/04 11:10:36 - Pentaho MapReduce -     ... 29 more

The solution is very easy - don't try to troubleshoot network or search for opened/closed ports on VM. The trick is that by default cloudera resource manager is restrict requests from any hosts - so since we are running in sandbox and don't care for advanced security now - just bind Resource manager address to a wildcard - so now Spoon using host machine ip will be able to connect to RM.

To be sure MapReduce application is running see log output in Spoon console (or wherever it is possible to see job logs):

As you can see now we can track progress - and that means if application will fail for any case - this will be another investigation.

Have a good distributed computations!

среда, 19 ноября 2014 г.

Pentaho MapReduce: failed on connection exception, quickstart.cloudera:10020 WHY?

One of most great features of Pentaho Spoon is ability to work with hadoop cluster. Or even if you not a an enterprise level - and you have no own petabyte data center - you can get some experience with Cloudera sandbox VM. Configuring pentaho shims is not trivial, and even if all is seems configured well - there could be very strange issues and you may spent a hours on some strange behaviors looking for solution, and this solution will be one or two lines in configs.

One of them. Running pentaho MapReduce example I got very strang issues:

014/11/19 11:39:05 - Pentaho MapReduce - ERROR (version, build 1 from 2014-09-30_19-48-28 by buildguy) : java.net.ConnectException: Call From EPBYMINW2910/ to quickstart.cloudera:10020 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2014/11/19 11:39:05 - Pentaho MapReduce - ERROR (version, build 1 from 2014-09-30_19-48-28 by buildguy) : java.io.IOException: java.net.ConnectException: Call From EPBYMINW2910/ to quickstart.cloudera:10020 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:532)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
2014/11/19 11:39:05 - Pentaho MapReduce -     at java.security.AccessController.doPrivileged(Native Method)
2014/11/19 11:39:05 - Pentaho MapReduce -     at javax.security.auth.Subject.doAs(Unknown Source)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:311)
2014/11/19 11:39:05 - Pentaho MapReduce -     at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:609)

Why does it happens? Why?

When I do check this job on server this job seems executed successfully but shown as failed in Spoon.

Spent some time on network investigation the answer is:

Just reduce logging interval from 60 up to 5 (Number of seconds between log messages). See screenshot below:

Hope it will help some time for you.
Happy hadoop execution with Pentaho!

пятница, 5 сентября 2014 г.

What this strange Pentaho issues is all about? Yesterday all was just perfect!

Yesterday all was just perfect. DI server schedule transformations, I was able to create and save transfromations to repository and etc. But today all goes wrong!

I've got an exceptions like this:

And copy-pasted stactrace is:

Unable to save repository element [/home/pentaho/host/Split field to rows - orig.ktr : Split field to rows - basic example]
 at org.eclipse.jface.operation.ModalContext$ModalContextThread.run (ModalContext.java:113)
 at org.pentaho.di.ui.spoon.dialog.SaveProgressDialog$1.run (SaveProgressDialog.java:70)
 at org.pentaho.di.repository.AbstractRepository.save (AbstractRepository.java:126)
 at org.pentaho.di.repository.pur.PurRepository.save (SourceFile:1679)
 at org.pentaho.di.repository.pur.PurRepository.save (SourceFile:1695)
 at org.pentaho.di.repository.pur.PurRepository.saveTrans (SourceFile:1902)
 at org.pentaho.di.repository.pur.PurRepository.saveTrans0 (SourceFile:1874)
 at org.pentaho.di.repository.pur.F.a (SourceFile:612)

So don't panic it is not a bug. This is just kindly reminder that pentaho is not full open-source, and even it is open-source it is not free as a free beer. You probably using enterprise edition and LICENSE PERIOD OF USE IS ENDED.

Pay attention to 

at org.pentaho.di.repository.pur.F.a (SourceFile:612)

Experienced developer can realize that none will call class as 'F' and method like 'a'. This is defenitly kind of obfuscated code - and if you first time meet problems with obfuscated code - first thing you need to check is licenses. In most cases it just expired. And in some places strange exceptions begins to rise ;)

So check this post, obtain and install licenses and this exception will probably go away.
If not - very sad. If you have problems with obfuscated code you may be enterprise customer - so feel free make a call to customer enterprise support ;)

Today is 5 September 2014, and I can use EE edition up to the end of this year, thank you very much!

Have a good enterprise edition BI intelligence!

вторник, 2 сентября 2014 г.

Mondrian 4 schemas for BI suite 5.1

Currently Mondrian is known as good implementation of olap4j, and ROLAP engine. So schema can be described in form of xml file, and this file can be used create analyzer report for Pentaho BI server (assuming enterprise license is available).

This is the link to github repository with source code and available mondrian versions: mondrian src

Currently there is different branches - 3.0, 3.1 ... 4.0, lagunitas, etc. Also BI server versions (4.4, 5.0, 5.1) uses different mondrain artifacts. I don't think this is a big secret - so it is always possible to check which version is used by current BI server installation. Navigate to


and check jar file called mondrian-X-X.jar - usually it contains mondrian version in it's name. Also since it is regular .jar file which is zip archive - it is possible to look inside (copy and unzip) to check actual version in VERSION.txt file.

To get started with mondrian 3.x shemas nice guide available from pentaho official site: 3v. schema specification
There is a book 'Mondrian in action' available in store: 4 version guide

Schema 3 version and schema 4 version have some differences, one of them for example - deprecation of VirtualCubes. Schema 4 introduces new (for mondrian case new) Logical level. So it is becomes a little bit similar to Oracle star schema concept (physical/logical/business levels). But mondrian still stay different. I've spent some time with Oracle OBIEE xml syntax generated by commonly known Admin Tool, so I did not come into conclusion which xml description is best. That would be interesting to compare mondrian description approach with Oracle tools, I would appreciate if someone can give me a link to some kind of real technical features comparisons (not just one more sales promotion).

This may not be necessary to rewrite version 3 schema to version 4. But for example for BI server 5.1 it will not be able to operate with 4 version schema since mondrian engine shipped with BI 5.1 is mondrian 3.7.

So if you have "Mondrian in action" book, mondrian-4.0.0.jar file on shelve and want to see what will happens with BI Analyzer shipped with mondrian 4 engine (I suppose not for production) follow next steps:

1) Replace mondrian jar in pentaho-solutions/system/osgi/bundles/ with version 4 item.

2) Add a new connection by adding properties like these. Adjust as necessary:


3) Restart your BA server and create a new analyzer report. The cubes from the schema should appear in the list when analyzer starts. If the cubes don't appear, you can open up one of these log files. There's probably an error or a problem with your schema.


If the last two logs are not there, you might have to enable them first, then do a quick BA server restart. Open the file pentaho-solutions/system/osgi/log4j.properties and uncomment the part at the bottom for mondrian logs.


Happy mondrain 4 in action schemas!

понедельник, 1 сентября 2014 г.

Get mondrian in eclipse, up and running

If you are interested in how does mondrian is  operates under the hood, want to debug some mondrian based applications or just want to discover mondrian api - this article is for you. We will check out mondrian project from git and run it in eclipse in windows.

So eclipse with egit and ivy plugin is required.

1) checkout from git

git clone https://github.com/pentaho/mondrian.git mondrian

2) import to eclipse. Click import - project from git - select mondrian folder, click ok - so project will be available in your classpath. But it will fail to build so additional steps required.

3) Change eclipse build path to 'mondrian-2.0/bin/classes', add ivy dependency library, resolve project.

4) from command line run


- this will attempt to resolve project and generate resources files. Mondrian uses some autogenreated resource files. So every time you made changes in eclipse - and want to run somthing locally run

ant ant generate.resources

5) rename mondrian.properties.template into mondrian.properties and edit connection properties to database will be used as a data source.

6) In eclipse from testsrc/main run some files - result should be green. Make sure appropriate driver is available in your eclipse class-path.

Happy mondrian debugging!

среда, 27 августа 2014 г.

Pentaho operations mart PDI install is not working out of the box! WTF?

This is short and unofficial guide how-to install and configure operation mart feature for Pentaho Enterprise Edition. There is also official pentaho wiki documentation page located here, but some users constantly found this frustrating and not able to install and config operation mart from scratch on a first try. I am one of this guys, so that is why I've create this note with my personal experience.

The main idea or PDI operations mart is that all our jobs and transformations will write logs to logging table. We will inspect whole repository, for every job or transformation we will find a database it is write log to, we will load data from this database and populate star schema. Then we will create reports (analyzer) and see what happens. To scan repository we will use some jobs and transformations provided for us as a part of EE distribution.

As a prerequisite check the following:
  1. Should be able to find 'pentaho-operations-mart' as a zip file or a folder already unzipped.
  2. Where is your EE edition of DI server installed.
  3. Which database connections DI server is uses. Can create/edit JNDI connections.
  4. Be familiar with jobs/transformations logging capabilities (check here).
  5. Check where is your kettle .home folder is located.
  6. Be ready to execute sql scripts against target database (pgAdmin or SQLDeveloper etc.)
  7. Spoon installed, with EE repository connection available.

Ok, lets start installation. Usually we will operate with 2 main connections: 'live_logging_info' and 'pentaho_operations_mart' for PDI part. I will call them LL and OP accordingly in a text below. I don't like long names for configurations.

LL connection is used to store raw logging info for transformations and jobs executed from DI server (or locally). Logging can be configured with job or transformation tab. Usually it is connection, schema and table defined. Spoon can handle automatic logging tables creation.

This image shows typical job properties - log tab where is actually logging is not configured yet. We still be using logging - but this will be regular console output. We want to have our logs written in database to have ability to analyze them later. So we can define logging to a relational tables for all jobs and transformations in repository - so we will be  able to harvest all this logs and clean up obsolete records.

That would be a big grind work to edit jobs and transformations one by one to define log connection and log tables for every repository item. So pentaho architectures introduced a couple of environment variables to define logging connections , tables, and schema names.

Here is the example of my kettle.properties file:

#log job db connection
#log trans db connection
#log step db connection
#log job entry db connection
#trans perfomance db connection
#channel table
# what about checkpoints?

Based on example provided above we use one connections named 'live_logging_info' - we agreed to call this connection  LL, same schema and different tables. If some one say to you that tables should have one name - don't believe stranger's advices, check it yourself. This is different logging tables under one schema.
Also remember - 'job_logs', 'trans_logs' - this is log tables names created in Spoon using execute sql button.

Some existed distributions provide a predefined sql script to create logging table structures. I would aware not to use it at all - but create logging tables structure using Spoon. Be sure tables created with all columns - means all check-boxes is checked (see image above).

Note that LL connection is not necessary to be called 'live_logging_info'. Even you may have different logging connections - for example some items will use one connection, some items other. When star logging model will be populated - all connections will be visited to load raw logging data.

It would be useful to have a sql script created all this DDLs. You can do it first create this tables with Spoon, and then use some kind of application that can dump DDL on a fly from real database.

Here is how my logging tables looks in pgAdmin:

Try to run some jobs or transformations and check logging info should start to be collected. In this tables. Sure if you have LL connection defined. Remember - for all jobs and transformations connections named 'LL' should be accessible. If we have define global logging with kettle properties - all jobs and transformations will get prerequisite to establish this connection. Incorrect kettle.properties values or problems with LL connection can fail ETL.

Since we have opened pgAdmin - try to find pentaho_operations_mart schema. This is how it is looks on my local machine:

Ok, we have LL connection where is logging info is collected (raw logging data), and a mart schema where is our logging info will be :
  • E 'extracted from LL database' 
  • T transformed to a star structure
  • L loaded into star 
So be ready to import some jobs and transform into repository.
Open Spoon, click 'Tools - Repository - Import repository...', find import xml file. It is called 'pdi-operations-mart.xml' and located somewhere under '...pdi-operations-mart\etl\' path.

Import this file in repository structure in folder /public/pentaho-operations/mart'
Finally we should be able to see somthing like that:

Run two jobs:
'Fill in DIM_DATE and DIM_TIME' and then 'Update Dimensions then Logging Datamart'.

First one just creates time dimensions and data dimensions. Second one - inspects repository for  jobs and transformations, extracts information about logging outputs for every item (remember - we did set this with kettle.propreties above, but some jobs or transformations can have different log connections), then extracts information about raw logging data sources, extract it and load into star.

So we can schedule job to extract this data on some periods of time. We can also schedule  'Clean up' job which will prevent operation mart model from getting very big.

Schedule appropriate jobs and transformation.

In result we may expect that on some time period this transformations will be executed collecting logging info and populating star model. Then we can use this model to create some reports.

Common issues installing PDI data mart is:
  1. Often some people muss with BI op mart and PDI op mart. BI op mart collects info about transformations and jobs running on BI server, while PDI op mart collects info about items running on PDI server.
  2. All connection is defined as JNDI connections - that means it is necessary to edit PDI server context.xml and web.xml to point to correct settings.
  3. On BI server reports does not show info from PDI operations mart. Note - sample reports for PDI uses predefined .xmi model to fire mdx queries - and this xmi model uses JNDI connection - so BI server have to have JNDI connection to correct PDI star schema
  4. One of most popular "Where is Oracle and MySql dialects for PDI operations mart transformations" - the answer is: Postgres, MySql and Oracle is the same! the only difference is  JNDI connection type. If you familiar with JNDI - we will get connection by interface - but for Kettle (and for other multi-db engines like mondrian) - we have to know connection type to handle db specific SQL query building. For example in mondrian DB type is discovered on a fly - for Spoon - we have to define it manually.
  5. So where is... - Just edit manually for all transformations and jobs connection type from Postgres to Oracle or MySql - after export to EE repository and only once - in repository browser - connection used for 'pentaho_operations_mart'.
  6. MySql does not have Schemas! - don't worry. Just create database called 'pentaho_operations_mart' - to be consistent with schema 'pentaho_operations_mart' - so SQL query when using Oracle or Postgres 'schema.table' will not conflict with MySql 'db.table'.
  7. Create separate user for start schema and for logging tables. Make sure this user have RW rights on this tables. Make sure if schema name is not specified - by default appropriate will be used. For example do not use 'hibuser' - since it have access to more then one schema on PostgreSQL - and if some queries in transformation will not specify schema - by default correct one will be used.
 Have a good informative PDI_Operation_Mart star models.