MinIO is a high performance, distributed object storage system. It is software-defined, runs on industry standard hardware and is 100% open source under the Apache V2 license. Today, MinIO is deployed globally with over 272.5M+ docker pulls and 18K+ git commits. MinIO is written in “go” language. So, expect it to have fast response. You can read more about this here.
MinIO and Data Science
MinIO has played a pivotal role in data science deployment. Today, we deal with data in various formats such as images, videos, audio clips and other proprietary format objects. Storing this information in traditional databases is quite challenging and does not have high response times for high frequency applications.
The other application of MinIO in data science is storing trained models. Deep learning models usually have bigger file size compared to its counterpart machine learning models which are usually few KB to MB.
MinIO officially supports integration with python, go language, etc. As a heavy R user it can be quite challenging to use MinIO through R. The first solution that came to my mind was to use reticulate to access MinIO through python. Again, this is good for testing and not feasible for deployment into production.
MinIO could be installed in few lines of code and is well documented here. MinIO can be deployed on Linux, mac, Windows, and K8’s. for prototyping I would recommend running a stateful docker container. Instructions for running on docker can be found here.
R Package minio.s3
MinIO is compatible with Amazon S3 cloud service. So, we can technically use S3 compatible API’s to access MinIO storage. You might be wondering don’t we already have a package for accessing Amazon Web Services (AWS) through R? You are right, R does have a package called aws.s3 that we could use to access AWS which was developed by cloudyR team. I tried using that package and it was quite clunky to access MinIO and not all functions were compatible.
So, my solution was to use their package and tweek quite a lot and could be used for accessing MinIO. The end product was minio.s3 package.
I would like to thank cloudyR team for their initial contributions for this package.
This package is not yet on CRAN. To install the latest development version you can install from the github:
By default, all packages for AWS/MinIO services allow the use of credentials specified in a number of ways, beginning with:
- User-supplied values passed directly to functions.
- Environment variables, which can alternatively be set on the command line prior to starting R or via an
.Renvironfile, which are used to set environment variables in R during startup (see
? Startup). Or they can be set within R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "test", # enter your credentials "AWS_SECRET_ACCESS_KEY" = "test123", # enter your credentials "AWS_DEFAULT_REGION" = "us-east-1", "AWS_S3_ENDPOINT" = "192.168.1.1:8085") # change it to your specific IP and port
For more information on aws usage, refer to aws.s3 package.
The package can be used to examine publicly accessible S3 buckets and publicly accessible S3 objects.
library("minio.s3") bucketlist(add_region = FALSE)
If your credentials are incorrect, this function will return an error. Otherwise, it will return a list of information about the buckets you have access to.
Create a bucket
To create a new bucket, simply call
put_bucket('my-bucket', acl = "public-read-write", use_https=F)
If successful, it should return
List Bucket Contents
To get a listing of all objects in a public bucket, simply call
get_bucket(bucket = 'my-bucket', use_https = F)
To delete a bucket, simply call
delete_bucket(bucket = 'my-bucket', use_https = F)
There are eight main functions that will be useful for working with objects in S3:
s3read_using()provides a generic interface for reading from S3 objects using a user-defined function
s3write_using()provides a generic interface for writing to S3 objects using a user-defined function
get_object()returns a raw vector representation of an S3 object. This might then be parsed in a number of ways, such as
jsonlite::fromJSON(), and so forth depending on the file format of the object
save_object()saves an S3 object to a specified local file
put_object()stores a local file into an S3 bucket
s3save()saves one or more in-memory R objects to an .Rdata file in S3 (analogously to
s3saveRDS()is an analogue for
s3load()loads one or more objects into memory from an .Rdata file stored in S3 (analogously to
s3readRDS()is an analogue for
s3source()sources an R script directly from S3
They behave as you would probably expect:
# save an in-memory R object into S3 s3save(mtcars, bucket = "my_bucket", object = "mtcars.Rdata", use_https = F) # `load()` R objects from the file s3load("mtcars.Rdata", bucket = "my_bucket", use_https = F) # get file as raw vector get_object("mtcars.Rdata", bucket = "my_bucket", use_https = F) # alternative 'S3 URI' syntax: get_object("s3://my_bucket/mtcars.Rdata", use_https = F) # save file locally save_object("mtcars.Rdata", file = "mtcars.Rdata", bucket = "my_bucket", use_https = F) # put local file into S3 put_object(file = "mtcars.Rdata", object = "mtcars2.Rdata", bucket = "my_bucket", use_https = F)
Please feel free to email me if you have any questions or comments. If you have any issues in the package, please create an issue on github. Also, check out my github page for other R packages, tutorials and other projects.