Transfer Data from RDBMS to Hadoop Using Sqoop/Oozie/Hue

A lengthy title. I know.

Cutting to the chase, I needed to transfer data from my RDBMS to a Hadoop cluster and I only had access to Hue/Oozie (since I’m not an admin). I knew that I could use Sqoop to do it — but I’ve never really done it before.

It was freaking hard/annoying!

So to help others out there who might be in a similar predicament as I was, here are some 101.

1. I assume you already know how to use the Workflow Editor, so from there, create a new Workflow.

2. Drag a Sqoop action from the panel above and click OK.

3. You’ll get some pre-filled sqoop command in there which you can use as reference. Hop to Apache Sqoop to learn more about all available arguments you can use.

4. There’s 2 way you can go about entering the Sqoop command from here on out. You can either type in the Sqoop command in the text box, OR if you’re thinking of using a query in your command, my recommendation is to use the argument window. The latter is based on a post from SO, where some guys had issues when using the free-form query in Sqoop/Oozie.

5. Change the value of the property sqoop.connection.factories to your corresponding DB. In my case I’m taking data from a Netezza, so below is my setting.

You need to click this to change the settings below.

6. Include hive-site.xml into your Sqoop workflow. Refer to key points section in this article for more explanation as to why we need to include the file into our workflow.

7. You’re all set. You may now run your workflow.

8. If the above doesn’t work, try checking out the stdout logs and determine where it could’ve went wrong. In my case, I’m using the username HAFIDZ but the table (in Netezza) was created by DBA. To fix this, we need to include the schema along with the table name in the ‘table’ parameter. For example, DBA.TABLENAME

References (in no particular order):
[1] Sqoop User Guide :
[2] How to Oozie + Sqoop + Hive:
[3] Sqoop Free-form Query Causing Unrecognized Arguments in Hue/Oozie :


Popular posts from this blog

HIVE: Both Left and Right Aliases Encountered in Join

Assign select result to variable in Netezza stored procedure

Splitting value in Netezza using array_split