Building a data pipeline- uploading external data in AWS S3

[This article was first published on Stories Data Speak, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Recently, I stepped into the AWS ecosystem to learn and explore its capabilities. I’m documenting my experiences in these series of posts. Hopefully, they will serve as a reference point to me in future or for anyone else following this path. The objective of this post is, to understand how to create a data pipeline. Read on to see how I did it. Certainly, there can be much more efficient ways, and I hope to find them too. If you know such better method’s, please suggest them in the comments section.

How to upload external data in Amazon AWS S3

Step 1: In the AWS S3 user management console, click on your bucket name.


Step 2: Use the upload tab to upload external data into your bucket.


Step 3: Once the data is uploaded, click on it. In the Overview tab, at the bottom of the page you’ll see, Object Url. Copy this url and paste it in notepad.


Step 4:

Now click on the Permissions tab.

Under the section, Public access, click on the radio button Everyone. It will open up a window.

Put a checkmark on Read object permissions in Access to this objects ACL. This will give access to reading the data from the given object url.

Note: Do not give write object permission access. Also, if read access is not given then the data cannot be read by Sagemaker


AWS Sagemaker for consuming S3 data

Step 5

  • Open AWS Sagemaker.

  • From the Sagemaker dashboard, click on the button create a notebook instance. I have already created one as shown below.


  • click on Open Jupyter tab

Step 6

  • In Sagemaker Jupyter notebook interface, click on the New tab (see screenshot) and choose the programming environment of your choice.


Step 7

  • Read the data in the programming environment. I have chosen R in step 6.


Accessing data in S3 bucket with python

There are two methods to access the data file;

  1. The Client method
  2. The Object URL method

See this IPython notebook for details.

AWS Data pipeline

To build an AWS Data pipeline, following steps need to be followed;

  • Ensure the user has the required IAM Roles. See this AWS documentation
  • To use AWS Data Pipeline, you create a pipeline definition that specifies the business logic for your data processing. A typical pipeline definition consists of activities that define the work to perform, data nodes that define the location and type of input and output data, and a schedule that determines when the activities are performed.

Note: To be continued

To leave a comment for the author, please follow the link and comment on their blog: Stories Data Speak. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)