A blog from Cloudera.
Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this:
- HDFS is used simply as an archival medium for historical data living on the mainframe. It is cost effective to store data in HDFS.
- Organizations want to move some processing workloads to Hadoop to free up CPU cycles on the mainframe.
Regardless, there was no solution available in Sqoop to allow users to easily move data from mainframe to Hadoop. We at Syncsort wanted to fill that void and started discussions with some Sqoop committers at Cloudera. They were very excited and receptive to the idea of Syncsort making an open source contribution to Sqoop. We decided that this contribution would go into Sqoop version 1.x since it is still widely used. We raised a Sqoop JIRA ticket SQOOP-1272 to work towards that.
Brief Introduction to Mainframe Datasets
Unlike UNIX where a file consists of a sequence of bytes, files on mainframe operating systems contain structured data. The term “datasets” is used to refer to “files.” (Hereafter, we will use “datasets” when we refer to mainframe files.) Datasets have associated access methods and record formats at the OS level. The OS provides APIs to access records from datasets. Most mainframes run an FTP server to provide remote access to datasets. The FTP server accesses records using these APIs and passes them to clients. A set of sequential (an access method) datasets can be stored under something similar to a directory called Partitioned Data Set (PDS) on the mainframe. FTP is not the only way to access datasets on the mainframe remotely. There are proprietary software products like IBM’s Connect:Direct for remote access to datasets on a mainframe.
Once we decided to make a contribution, I started working on the functional specification with the Syncsort engineering team. In terms of functionality, we wanted to start with small, but fundamental steps to accomplishing the objective of mainframe connectivity via Sqoop. Our goals were very modest:
- Connect to the mainframe using open source software.
- Support datasets that can be accessed using sequential access methods.
The first goal was a no-brainer. Sqoop being an Apache project, we decided to use Apache commons-net library for FTP client. The second goal follows the first since mainframe FTP server supports transferring sequential datasets only. On further discussions with the Sqoop team at Cloudera, we decided to support transferring a set of sequential datasets in a partitioned dataset. The transfer will happen in parallel. Each dataset will be stored as a separate HDFS file when the target is HDFS file or Hive/HCatalog. The sequential datasets are transferred to Hadoop in FTP text mode. Each record in the datasets will be treated as one text field in the target.
In essence, a new import tool called import-mainframe would be added to Sqoop. In the --connectoption, user can specify a mainframe host name. In the --dataset option, user can specify the PDS name. For more details, readers can refer to the design document online.
Now that the functionality was defined, I started working with the Syncsort engineering team on the design. I posted an overall design document in the JIRA for feedback from Sqoop committers. We went through a couple of iterations and agreed on it.
At the top level, the design involves implementing a new Sqoop tool class (MainframeImportTool) and a new connection manager class (MainframeManager.) If you dig a little deeper, there are support classes like mainframe specific Mapper implementation (MainframeDatasetImportMapper), InputFormatimplementation (MainframeDatasetInputFormat), InputSplit implementation (MainframeDatasetInputSplit),RecordReader implementation (MainframeDatasetRecordReader), and so on.
Next came the most difficult part, the actual implementation. Members of Syncsort engineering played a vital role in the detailed design and implementation. Since it is impossible to connect to a real mainframe in Apache testing environment, we decided to contribute unit tests based on Java mock objects. It was amazing to discover that Sqoop never used mock objects for testing before! We ran end-to-end tests with real mainframe in-house to verify the correctness of the implementation.
Patch Submission and Review Process
Once we were satisfied with the implementation and testing at Syncsort, I posted a patch to SQOOP-1272 for review. I was advised to post the patch to Sqoop Review Board. I have worked on a handful of Apache Hadoop JIRAs before. Most of the comments and discussions on a patch happen in the JIRA itself. It was a new experience for me to go through the Review Board. After some initial getting-used-to, I liked it very much since it kept most of the noise off of the JIRA. Also, the review comments were easy to address. There was a suggestion to add documentation to this new feature. I started digging through the Sqoop documentation files. I had to learn the documentation format. With some trial and error, I was able to understand it. It took a while to update the documentation. With busy schedules at work and despite Summer vacation interrupting my work on the JIRA, the review process was moving steadily. After a few cycles of review, the code and documentation got better and better. Finally, on Sept. 10, 2014, the patch to the JIRA was committed by Venkat. SQOOP-1272 will go into official Sqoop release 1.4.6.
The design and implementation of SQOOP-1272 allow anyone to extend the functionalities. Users can extendMainframeManager class to provide more complex functionalities while making use of the current implementation.
Syncsort’s flagship Hadoop ETL product DMX-h provides enterprise grade mainframe connectivity with an easy to use graphical user interface. It has been enhanced so that it can be invoked from Sqoop. In particular, it supports the following features:
- Ability to specify complex COBOL copybooks to define mainframe record layouts.
- Ability to transfer a mainframe dataset “as is” in binary form to archive a golden copy in Hadoop.
- Ability to transfer VSAM (Virtual Storage Access Method) datasets