Setting up a DataScience Server

[This article was first published on R – Networkx, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After installing multiple software, servers etc. on my  laptop it was overloaded with different tools and running services. When I get a new laptop or it will crash I can start over again installing everything, at home on my iMac I had the same tools and servers. So I decided to setup a DataScience Server with a necessary software and servers.

I know there are a lot easier and faster projects for setting up a datascience server, but I will only install one server with all the necessary software that I can connect to in my home network. Besides that it’s a lot of fun for doing this ?

In this post I will install a minimal CentOS 7 server with the containing software and servers to start with datascience:

  • Anaconda Python
  • The Jupyter Notebook
  • R and Rstudio Server
  • MongoDB
  • Splunk®

Minimal install CentOS 7 for setting up a datascience server

I will not explain it in detail, if you are not familiar with a CentOS installation, there are a lot of manuals to find.
Get a fresh “Minimal ISO” copy of the CentOS 7 image from https://www.centos.org/download/.
Burn it with your favorite software or mount it in your new virtual machine and boot it. I have changed some things like root password, timezone, disk layout etc.

If you have finished the minimal installation we need to install some needed packages.

# yum -y install net-tools ntp wget bzip2

Configure NTPD

You can edit your configuration and servers with vi /etc/ntp.conf, the default is good enough for me.

# systemctl start ntpd
# systemctl enable ntpd
# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+dns02.wsrs.net  193.79.237.14    2 u   15   64    1    9.496   -8.290   0.259
*84-245-30-184.d 193.79.237.14    2 u   14   64    1   12.822    3.127   0.173
 ntp4.bit.nl     .PPS.            1 u   13   64    1   11.873    2.484   0.303
 ran.as65342.net 192.36.144.23    2 u   12   64    1   11.692    2.915   0.174

Installing Anaconda Python

Anaconda is the leading open datascience platform powered by Python. The open source version of Anaconda is a high performance distribution of Python and R and includes over 100 of the most popular Python, R and Scala packages for datascience. (source: https://www.continuum.io)

Get the latest Linux version from https://www.continuum.io/downloads
This package has a total size of 392M

Follow the instructions, I changed the install location:

# wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
# bash Anaconda2-4.0.0-Linux-x86_64.sh
....
Anaconda2 will now be installed into this location:
/root/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda2] >>> /usr/local/bin/anaconda2

And I have updated my .bashrc

Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /root/.bashrc ? [yes|no]

[no] >>> yes

After the installation is completed check your path and reinitialise it.

# cat .bashrc
....
# added by Anaconda2 4.0.0 installer 
export PATH="/usr/local/bin/anaconda2/bin:$PATH"
# echo $PATH 
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin 
# which python 
/usr/bin/python

This is the default installed python with CentOS, we need the anaconda python to be default.

# cd ~
# . .bashrc 
# echo $PATH 
/usr/local/bin/anaconda2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin 
# which python 
/usr/local/bin/anaconda2/bin/python 
# python 
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec  6 2015, 18:08:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
Anaconda is brought to you by Continuum Analytics. 
Please check out: http://continuum.io/thanks and https://anaconda.org 
>>> quit() 
#

Installing The Jupyter Notebook

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more for doing datascience. (source: http://jupyter.org/)

# conda install jupyter

The notebook will default run on localhost and you need to start it by hand. I’ve created a systemd unit file to start it automaticly and runs on a different user.

First create the user:

# useradd jupyternb
# su - jupyternb
# mkdir notebooks
# exit

Create the system unit file for automatically start on boot.

# vi /usr/lib/systemd/system jupyter-notebook.service

In vi press i to enable insert and copy paste below, at the end press :wq

[Unit]
Description=The Jupyter HTML Notebook

[Service]
Type=simple
PIDFile=/var/run/jupyter-notebook.pid
ExecStart=/usr/local/bin/anaconda2/bin/jupyter notebook --no-browser --ip=*
User=jupyternb
Group=jupyternb
WorkingDirectory=/home/jupyternb/notebooks

[Install]
WantedBy=multi-user.target

Now we created the unit file we only need to reload the inits and enable the system unit file fo the Jupyter Notebook.

# systemctl daemon-reload
# systemctl enable jupyter-notebook
Created symlink from /etc/systemd/system/multi-user.target.wants/jupyter-notebook.service to /usr/lib/systemd/system/jupyter-notebook.service.
# systemctl start jupyter-notebook
# systemctl status jupyter-notebook
● jupyter-notebook.service - The Jupyter HTML Notebook
Loaded: loaded (/usr/lib/systemd/system/jupyter-notebook.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2016-05-16 16:23:41 CEST; 2s ago
Main PID: 2347 (jupyter-noteboo)
CGroup: /system.slice/jupyter-notebook.service
└─2347 /usr/local/bin/anaconda2/bin/python /usr/local/bin/anaconda2/bin/jupyter-notebook --no-browser --ip=*
May 16 16:23:41 dss.home systemd[1]: Starting The Jupyter HTML Notebook...
May 16 16:23:41 dss.home systemd[1]: Started The Jupyter HTML Notebook.
May 16 16:23:42 dss.home jupyter[2347]: [W 16:23:42.101 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
May 16 16:23:42 dss.home jupyter[2347]: [W 16:23:42.101 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure... recommended.
May 16 16:23:42 dss.home jupyter[2347]: [I 16:23:42.108 NotebookApp] Serving notebooks from local directory: /home/jupyternb/notebooks
May 16 16:23:42 dss.home jupyter[2347]: [I 16:23:42.109 NotebookApp] 0 active kernels
May 16 16:23:42 dss.home jupyter[2347]: [I 16:23:42.109 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/
May 16 16:23:42 dss.home jupyter[2347]: [I 16:23:42.109 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Hint: Some lines were ellipsized, use -l to show in full.

If the daemon started you can connect to http://your-server-name-here:8888 and shows your home screen.

datascience - jupyter

For detail configuration, like encryption and authentication you can check the official Jupiter documentation here http://jupyter-notebook.readthedocs.io/en/latest/

Installing R and Rstudio

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. (source: https://www.r-project.org/)

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. (source: https://www.rstudio.com/)

For installing R we need to install the Extra Packages for Enterprise Linux (EPEL) repo.

# yum -y install epel-release

Refresh the repo

# yum repolist

Now we can install R

# yum -y install R

This will install about 390 packages, so get a cup of coffee ?

Install  1 Package (+389 Dependent packages) 
Total download size: 337 M 
Installed size: 819 M

If R is installed we can install Rstudio-server
I have used the instructions from https://www.rstudio.com/products/rstudio/download-server-2/ This package has a total size of 280M

# yum install --nogpgcheck https://download2.rstudio.org/rstudio-server-rhel-0.99.896-x86_64.rpm

If everything went fine you can connect to your server with the following URL,  and you will see a sign in screen.

http://your-server-name-here:8787

datascience - rstudio

See the official Getting Started document for information configuring and managing the server.

Installing MongoDB

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. (source: https://www.mongodb.com/)

I have MongoDB installed as described on the MongoDB site you can find it here. Why should I write it again :). I installed CentOS 7, and therefore I used the Red Hat installation guide. To determine which platform you run, check it with the following command on CentOS.

# getconf LONG_BIT
64
#

As you can see we are running on a 64-bit platform, thats fine because the installation guide only supports 64-bit systems ?

Disable SELinux by setting the SELINUX setting to disabled in /etc/selinux/config.

SELINUX=disabled

You can check if MongoDB is running and listen on the tcp port.

# ps aux | grep -i mongo
mongod    1138  0.5  8.3 1502560 243764 ?      Sl   Jul04  88:26 /usr/bin/mongod -f /etc/mongod.conf
# netstat -atnp | grep -i mongo
tcp        0      0 0.0.0.0:27017           0.0.0.0:*               LISTEN      1138/mongod

Installing Splunk®

You see servers and devices, apps and logs, traffic and clouds. We see data—everywhere. Splunk®offers the leading platform for Operational Intelligence. It enables the curious to look closely at what others ignore—machine data—and find what others never see: insights that can help make your company more productive, profitable, competitive and secure. What can you do with Splunk?
Just ask.

stock_notes-1For downloading Splunk® you need to create an account on www.splunk.com.

You can get your free Splunk® Enterprise here: https://www.splunk.com/en_us/download/splunk-enterprise.html

We need to choose Linux, than the 64-bits, stop the download because we are going it to download with the wget command. On the right side we can find “Got wget?”, press that and copy the URL into your linux console to download the rpm package.

datascience-splunk-download

If Splunk® is downloaded than install it with rpm

# rpm -Uvh splunk-6.4.1-debde650d26e-linux-2.6-x86_64.rpm
warning: splunk-6.4.1-debde650d26e-linux-2.6-x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 653fb112: NOKEY
Preparing...                          ################################# [100%]
useradd: cannot create directory /opt/splunk
Updating / installing...
   1:splunk-6.4.1-debde650d26e        ################################# [100%]
complete
#

After install go the the directory where Splunk® is installed and start it. We accept the license directly with the start.

# cd /opt/splunk/bin
# ./splunk start --accept-license

This appears to be your first time running this version of Splunk.
Copying '/opt/splunk/etc/openldap/ldap.conf.default' to '/opt/splunk/etc/openldap/ldap.conf'.
Generating RSA private key, 1024 bit long modulus
.......++++++
..++++++
e is 65537 (0x10001)
writing RSA key

Generating RSA private key, 1024 bit long modulus
........++++++
.........++++++
e is 65537 (0x10001)
writing RSA key

Moving '/opt/splunk/share/splunk/search_mrsparkle/modules.new' to '/opt/splunk/share/splunk/search_mrsparkle/modules'.

Splunk> Winning the War on Error

Checking prerequisites...
	Checking http port [8000]: open
	Checking mgmt port [8089]: open
	Checking appserver port [127.0.0.1:8065]: open
	Checking kvstore port [8191]: open
	Checking configuration...  Done.
		Creating: /opt/splunk/var/lib/splunk
		Creating: /opt/splunk/var/run/splunk
		Creating: /opt/splunk/var/run/splunk/appserver/i18n
		Creating: /opt/splunk/var/run/splunk/appserver/modules/static/css
		Creating: /opt/splunk/var/run/splunk/upload
		Creating: /opt/splunk/var/spool/splunk
		Creating: /opt/splunk/var/spool/dirmoncache
		Creating: /opt/splunk/var/lib/splunk/authDb
		Creating: /opt/splunk/var/lib/splunk/hashDb
	Checking critical directories...	Done
	Checking indexes...
		Validated: _audit _internal _introspection _thefishbucket history main summary
	Done
New certs have been generated in '/opt/splunk/etc/auth'.
	Checking filesystem compatibility...  Done
	Checking conf files for problems...
	Done
	Checking default conf files for edits...
	Validating installed files against hashes from '/opt/splunk/splunk-6.4.1-debde650d26e-linux-2.6-x86_64-manifest'
	All installed files intact.
	Done
All preliminary checks passed.

Starting splunk server daemon (splunkd)...  
Generating a 1024 bit RSA private key
........++++++
....++++++
writing new private key to 'privKeySecure.pem'
-----
Signature ok
subject=/CN=dss.home/O=SplunkUser
Getting CA Private Key
writing RSA key
Done
                                                           [  OK  ]

Waiting for web server at http://127.0.0.1:8000 to be available.... Done


If you get stuck, we're here to help.  
Look for answers here: http://docs.splunk.com

The Splunk web interface is at http://your-server-name-here:8000

You can now connect to http://your-server-name-here:8000

datascience-splunk

Configure Splunk® to start automatically

# cd /opt/splunk/bin
# ./splunk enable boot-start

Well thats it, you we can now start to gather some data and doing some datascience.

If you have some questions, follow me on Twitter or mail me, in the footer you can find my contact information.

Good luck!

The post Setting up a DataScience Server appeared first on Networkx.

To leave a comment for the author, please follow the link and comment on their blog: R – Networkx.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)