среда, 27 августа 2014 г.

Pentaho operations mart PDI install is not working out of the box! WTF?

This is short and unofficial guide how-to install and configure operation mart feature for Pentaho Enterprise Edition. There is also official pentaho wiki documentation page located here, but some users constantly found this frustrating and not able to install and config operation mart from scratch on a first try. I am one of this guys, so that is why I've create this note with my personal experience.

The main idea or PDI operations mart is that all our jobs and transformations will write logs to logging table. We will inspect whole repository, for every job or transformation we will find a database it is write log to, we will load data from this database and populate star schema. Then we will create reports (analyzer) and see what happens. To scan repository we will use some jobs and transformations provided for us as a part of EE distribution.

As a prerequisite check the following:
  1. Should be able to find 'pentaho-operations-mart' as a zip file or a folder already unzipped.
  2. Where is your EE edition of DI server installed.
  3. Which database connections DI server is uses. Can create/edit JNDI connections.
  4. Be familiar with jobs/transformations logging capabilities (check here).
  5. Check where is your kettle .home folder is located.
  6. Be ready to execute sql scripts against target database (pgAdmin or SQLDeveloper etc.)
  7. Spoon installed, with EE repository connection available.

Ok, lets start installation. Usually we will operate with 2 main connections: 'live_logging_info' and 'pentaho_operations_mart' for PDI part. I will call them LL and OP accordingly in a text below. I don't like long names for configurations.

LL connection is used to store raw logging info for transformations and jobs executed from DI server (or locally). Logging can be configured with job or transformation tab. Usually it is connection, schema and table defined. Spoon can handle automatic logging tables creation.

This image shows typical job properties - log tab where is actually logging is not configured yet. We still be using logging - but this will be regular console output. We want to have our logs written in database to have ability to analyze them later. So we can define logging to a relational tables for all jobs and transformations in repository - so we will be  able to harvest all this logs and clean up obsolete records.

That would be a big grind work to edit jobs and transformations one by one to define log connection and log tables for every repository item. So pentaho architectures introduced a couple of environment variables to define logging connections , tables, and schema names.

Here is the example of my kettle.properties file:

#log job db connection
#log trans db connection
#log step db connection
#log job entry db connection
#trans perfomance db connection
#channel table
# what about checkpoints?

Based on example provided above we use one connections named 'live_logging_info' - we agreed to call this connection  LL, same schema and different tables. If some one say to you that tables should have one name - don't believe stranger's advices, check it yourself. This is different logging tables under one schema.
Also remember - 'job_logs', 'trans_logs' - this is log tables names created in Spoon using execute sql button.

Some existed distributions provide a predefined sql script to create logging table structures. I would aware not to use it at all - but create logging tables structure using Spoon. Be sure tables created with all columns - means all check-boxes is checked (see image above).

Note that LL connection is not necessary to be called 'live_logging_info'. Even you may have different logging connections - for example some items will use one connection, some items other. When star logging model will be populated - all connections will be visited to load raw logging data.

It would be useful to have a sql script created all this DDLs. You can do it first create this tables with Spoon, and then use some kind of application that can dump DDL on a fly from real database.

Here is how my logging tables looks in pgAdmin:

Try to run some jobs or transformations and check logging info should start to be collected. In this tables. Sure if you have LL connection defined. Remember - for all jobs and transformations connections named 'LL' should be accessible. If we have define global logging with kettle properties - all jobs and transformations will get prerequisite to establish this connection. Incorrect kettle.properties values or problems with LL connection can fail ETL.

Since we have opened pgAdmin - try to find pentaho_operations_mart schema. This is how it is looks on my local machine:

Ok, we have LL connection where is logging info is collected (raw logging data), and a mart schema where is our logging info will be :
  • E 'extracted from LL database' 
  • T transformed to a star structure
  • L loaded into star 
So be ready to import some jobs and transform into repository.
Open Spoon, click 'Tools - Repository - Import repository...', find import xml file. It is called 'pdi-operations-mart.xml' and located somewhere under '...pdi-operations-mart\etl\' path.

Import this file in repository structure in folder /public/pentaho-operations/mart'
Finally we should be able to see somthing like that:

Run two jobs:
'Fill in DIM_DATE and DIM_TIME' and then 'Update Dimensions then Logging Datamart'.

First one just creates time dimensions and data dimensions. Second one - inspects repository for  jobs and transformations, extracts information about logging outputs for every item (remember - we did set this with kettle.propreties above, but some jobs or transformations can have different log connections), then extracts information about raw logging data sources, extract it and load into star.

So we can schedule job to extract this data on some periods of time. We can also schedule  'Clean up' job which will prevent operation mart model from getting very big.

Schedule appropriate jobs and transformation.

In result we may expect that on some time period this transformations will be executed collecting logging info and populating star model. Then we can use this model to create some reports.

Common issues installing PDI data mart is:
  1. Often some people muss with BI op mart and PDI op mart. BI op mart collects info about transformations and jobs running on BI server, while PDI op mart collects info about items running on PDI server.
  2. All connection is defined as JNDI connections - that means it is necessary to edit PDI server context.xml and web.xml to point to correct settings.
  3. On BI server reports does not show info from PDI operations mart. Note - sample reports for PDI uses predefined .xmi model to fire mdx queries - and this xmi model uses JNDI connection - so BI server have to have JNDI connection to correct PDI star schema
  4. One of most popular "Where is Oracle and MySql dialects for PDI operations mart transformations" - the answer is: Postgres, MySql and Oracle is the same! the only difference is  JNDI connection type. If you familiar with JNDI - we will get connection by interface - but for Kettle (and for other multi-db engines like mondrian) - we have to know connection type to handle db specific SQL query building. For example in mondrian DB type is discovered on a fly - for Spoon - we have to define it manually.
  5. So where is... - Just edit manually for all transformations and jobs connection type from Postgres to Oracle or MySql - after export to EE repository and only once - in repository browser - connection used for 'pentaho_operations_mart'.
  6. MySql does not have Schemas! - don't worry. Just create database called 'pentaho_operations_mart' - to be consistent with schema 'pentaho_operations_mart' - so SQL query when using Oracle or Postgres 'schema.table' will not conflict with MySql 'db.table'.
  7. Create separate user for start schema and for logging tables. Make sure this user have RW rights on this tables. Make sure if schema name is not specified - by default appropriate will be used. For example do not use 'hibuser' - since it have access to more then one schema on PostgreSQL - and if some queries in transformation will not specify schema - by default correct one will be used.
 Have a good informative PDI_Operation_Mart star models.

2 комментария:

  1. Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care