reticulate, virtualenv, and Python in Linux
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.
reticulate
is an R package that allows us to use Python modules from within RStudio. I recently found this functionality useful while trying to compare the results of different uplift models. Though I did have R’s uplift
package producing Qini charts and metrics, I also wanted to see how things looked with Wayfair’s promising pylift
package. Since pylift
is only available in python, reticulate
made it easy for me to quickly use pylift
from within RStudio.
In the article below, I’ll show how I worked through the following circumstances:
- Since
pylift
has only been tested on Python >= 3.6, and my system version of Python was 2.7, I needed to build and install Python 3.6 for myself, preferably within a self-contained virtual environment. - I wanted to install
pylift
in the virtual environment and set upreticulate
in my R Project to work within that environment. - Finally, I needed to access
pylift
from an R Markdown document via thereticulate
interface.
Setting up Python, virtualenv, and RStudio
Note: for consistency, I always use an instance created via r-studio-instance and a base project from r-studio-project.
Python 2.7 is the default on the systems I use (CentOS 6/7). Since I did not want to modify the system-level Python version, I installed Python 3.6.x at the user level in $HOME/opt
and created a virtual environment using Python 3. I then activated the Python 3 environment and installed pylift
. Finally, I ensured RStudio-Server 1.2 was installed, as it has advanced reticulate
support like plotting graphs in line in R Markdown documents.
Below is a brief script that accomplishes the tasks in bash on CentOS 7:
cd ~ mkdir tmp cd tmp wget https://www.python.org/ftp/python/3.6.2/Python-3.6.2.tgz tar -xzvf Python-3.6.2.tgz cd Python-3.6.2 ./configure --prefix=$HOME/opt/python-3.6.2 --enable-shared make make install cd ~ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/opt/python-3.6.2/lib virtualenv -p $HOME/opt/python-3.6.2/bin/python3 pylift source pylift/bin/activate cd pylift git clone https://github.com/wayfair/pylift cd pylift pip install . pip install -r requirements.txt cd wget https://s3.amazonaws.com/rstudio-ide-build/server/centos6/x86_64/rstudio-server-rhel-1.2.1335-x86_64.rpm sudo yum install -y --nogpgcheck rstudio-server-rhel-1.2.1335-x86_64.rpm sudo rstudio-server start
Some notes:
- the
--enable-shared
option is required when building Python in order forreticulate
to work - the
LD_LIBRARY_PATH
library also needs to be set prior to creating the virtual environment - we use virtualenv to create a virtual environment called “pylift” and then ensure that all Python packages are installed to that environment only (so as not to pollute any other environments we are working with)
- we then clone the
pylift
source and installpylift
along with all of its requirements viapip install -r requirements.txt
- finally, we install the RStudio Server 1.2 Preview version in order to leverage its advanced
reticulate
features
Using Python from within RStudio via reticulate
Switching from bash to RStudio, we load reticulate
and set it up to use the virtual environment we just created. Finally, and specific to pylift
, we set matplotlib parameters so that we can plot directly in R.
library(reticulate) Sys.setenv(LD_LIBRARY_PATH = paste0(Sys.getenv("HOME"),"/opt/python-3.6.2/lib")) Sys.getenv("LD_LIBRARY_PATH") use_virtualenv("/home/rstevenson/pylift", required=TRUE) py_config() # Currently this must be run in order for R-markdown plotting to work matplotlib <- import("matplotlib") matplotlib$use("Agg", force = TRUE)
Test that it works
The following replicates the first part of pylift tutorial: simulated data
import matplotlib.pyplot as plt import numpy as np t = np.arange(0.0, 2.0, 0.01) s = 1 + np.sin(2*np.pi*t) plt.plot(t,s)
When run, the above code chunk should display a sinusoidal graph below it.
import numpy as np, matplotlib as mpl, matplotlib.pyplot as plt, pandas as pd from pylift import TransformedOutcome from pylift.generate_data import dgp # Generate some data. df = dgp(N=10000, discrete_outcome=True) # Specify your dataframe, treatment column, and outcome column. up = TransformedOutcome(df, col_treatment='Treatment', col_outcome='Outcome', stratify=df['Treatment']) # This function randomly shuffles your training data set and calculates net information value. up.NIV()
The above Python chunk uses reticulate
from within RStudio to interact with pylift
in the context of a custom virtual environment, using a custom version of Python. This degree of customization and functionality should be useful to users who:
- want to use a different Python version than they typically use while not affecting their typical setup by way of a virtual environment
- want to install a Python module like
pylift
within a virtual environment so as not to affect any of their user- or system-level Python module installations - want to use
reticulate
from RStudio to access a custom virtual environment, Python version, and Python modules - wants to be able to delete the virtual environment and R-Project and have everything go back to the way it was
- wants to be able to reproduce or share the environment exactly so that the workflow can be shared with others
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.