Pentaho Data Integration (ETL)

Pentaho Data Integration is the leading and most complete and mature product of the Pentaho BI tools family and along with Talend products it's a leader in the open source ETL market. Spoon has been developed actively since 2003 and during all those years has become a product which in many categories beats commercial ETL software like Informatica Powercenter, Datastage or Ab Initio. Keeping it open results in the fact that bugs get removed quickly, new features are developed quickly and it provides extreme flexibility compared to its competitors.

PDI is open source, however there's also a commercial paid version available (Pentaho Enterprise Edition). It provides additional tools, versioning, interoperability with enterprise schedulers, extensive customer support and is relatively cheap looking at etl tools market average. Depending on the sales agreements it is an average of 15k USD per year.

Pentaho Data Integration tools

PDI is a platform which contains the following development tools:

  • Kettle Spoon - an ETL development graphical tool where the data integration flows can be created (called transformations, chained as jobs).
  • Pan and Kitchen - command line tools which facilitate execution of jobs and transformations
  • Carte web server - a very basic tool used to execute etl processes remotely and implement partitioning and parallel execution.

PDI Strengths

  • The open source version of PDI is completely free and very easy to set up, configure and start to use right away in a few simple steps
  • Learning curve - anyone with BI / DW / ETL background can learn and start using Pentaho after a few hours of basic introduction. It has a very intuitive and rich in features user interface. A good starting point might be the Pentaho ETL tutorial
  • Impressive long list of connectors - it can read from and write to a number of sources and targets, including ODBC, JDBC, real time, queues, Kafka streams, AWS S3 / Redshift, EMR, Cloudera, Salesforce and many many more. For example the ease of setting up a connection to a folder with a bunch of excel files could make other commercial etl vendors jealous.
  • Support for Python, R, Spark (MLlib)
  • Integrates well with other open source BI solutions (including Big Data tools and realtime analytics)
  • Can be easily extended with Java
  • Scripts and formulas can be written using various programming languages
  • Open source community, forums and lots of web resources
  • Available in AWS Marketplace
  • PDI Enterprise edition relatively cheap (15-20k) comapred to other ETL's like Infa, Datastage, AbInitio which start at around 200k and can go up to a million.

PDI Weaknesses

  • Lack of easy monitoring, error handling and scheduling
  • Large, enterprise-wide deployments might not perform good enough
  • Team development (groupwork) not supported very well and difficult
  • It does not integrate well with the other Pentaho BI products family. In fact it's a totally separate tool
  • Up to version 4 it was lightweight and portable (just a few megabytes), however starting from version 5 it has become heavier, it has exceeded 1GB, on execution it takes time to initialize the runtime environment. Would be neat to still have an option to have a powerful data integration tool on a small pendrive without the need to install anything.
  • Pentaho has a label of open source platform, thus for some big organizations it's difficult to get green light for start using it (no SLA, support and maintenance agreement for open source version), perhaps lack of marketing power and battlecards against big commercial ETL providers.

Apache Hop and Pentaho Data Integration

Pentaho Data Integration (PDI) and Apache Hop share a common history, as Apache Hop originated as a fork of PDI. Both projects stem from the same foundational codebase, with PDI initially developed under the Pentaho umbrella, where it became a widely used tool for ETL processes. Over time, as the needs of the data integration community evolved, a group of developers sought to create a more modern, modular, and flexible platform, leading to the creation of Apache Hop. While both tools retain similarities in their core functionalities and concepts due to this shared origin, Apache Hop has diverged to focus on addressing limitations in PDI, such as enhancing modularity, supporting DevOps practices, and improving integration into modern CI/CD workflows. This shared history means that users familiar with PDI will find many concepts and tools in Apache Hop recognizable, even as Hop pushes toward a more contemporary data engineering environment.