AWS Datalake for SAP HANA

In 2019, the JDGroup embarked on a discovery phase with building a data lake. There were a number of decision-making points that needed to be addressed:

  • Could data be extracted from SAP HANA and written to Amazon Simple Storage Service (S3)?

  • Could the transformations that are currently being done in SAS, be affected using native AWS tools?

  • What would the costs be of extracting data from SAP HANA to S3?

  • Would this reduce pressure on their existing storage requirements?

In tackling this project, work commenced using AWS Glue to perform the extractions from the SAP system. In addition, while doing this project, a key aspect was to transfer skill between Cloud Fundis and the JDGroup technical team.

We compared the AWS Glue to other forms of extraction of the data from SAP. Specifically we used AWS EMR (using Zeppelin Notebooks) to help transfer skills to the JDGroup team.

Key outcomes from this project were:

  • It is possible and indeed relatively straightforward to extract data from SAP HANA to S3. At the time, even limited Internet bandwidth and extracting to the EU-WEST-1 region did not prove prohibitive for this project.

  • The cost comparisons showed that using AWS Glue, while simpler than Amazon EMR was in fact somewhat more expensive. Ultimately, to extract the data and write it to S3 was 20% cheaper using Amazon EMR than using AWS Glue.

  • In the POC, existing SAS code was re-written in Apache Spark and run on both AWS Glue and Amazon EMR platforms. Minor modifications allowed the customer to choose the platform best suited to their environment.

  • The customer has many thousands of lines of SAS code in production. This project showed that those codes could be migrated to an open platform using AWS native tools. Success in production however, would require a re-training of the staff, who currently have competence in SAS.

  • Finally, migrating to a data lake would relieve pressure on the customers storage requirements, and this would in turn lead to lower costs as older data in the data lake could be archived automatically with lifecycle policies to Amazon S3 Glacier.