Setting up a DataScience Server
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After installing multiple software, servers etc. on my laptop it was overloaded with different tools and running services. When I get a new laptop or it will crash I can start over again installing everything, at home on my iMac I had the same tools and servers. So I decided to setup a DataScience Server with a necessary software and servers.
I know there are a lot easier and faster projects for setting up a datascience server, but I will only install one server with all the necessary software that I can connect to in my home network. Besides that it’s a lot of fun for doing this
In this post I will install a minimal CentOS 7 server with the containing software and servers to start with datascience:
- Anaconda Python
- The Jupyter Notebook
- R and Rstudio Server
Minimal install CentOS 7 for setting up a datascience server
I will not explain it in detail, if you are not familiar with a CentOS installation, there are a lot of manuals to find.
Get a fresh “Minimal ISO” copy of the CentOS 7 image from https://www.centos.org/download/.
Burn it with your favorite software or mount it in your new virtual machine and boot it. I have changed some things like root password, timezone, disk layout etc.
If you have finished the minimal installation we need to install some needed packages.
# yum -y install net-tools ntp wget bzip2
You can edit your configuration and servers with vi /etc/ntp.conf, the default is good enough for me.
# systemctl start ntpd # systemctl enable ntpd # ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== +dns02.wsrs.net 18.104.22.168 2 u 15 64 1 9.496 -8.290 0.259 *84-245-30-184.d 22.214.171.124 2 u 14 64 1 12.822 3.127 0.173 ntp4.bit.nl .PPS. 1 u 13 64 1 11.873 2.484 0.303 ran.as65342.net 126.96.36.199 2 u 12 64 1 11.692 2.915 0.174
Installing Anaconda Python
Anaconda is the leading open datascience platform powered by Python. The open source version of Anaconda is a high performance distribution of Python and R and includes over 100 of the most popular Python, R and Scala packages for datascience. (source: https://www.continuum.io)
Get the latest Linux version from https://www.continuum.io/downloads
This package has a total size of 392M
Follow the instructions, I changed the install location:
# wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh # bash Anaconda2-4.0.0-Linux-x86_64.sh .... Anaconda2 will now be installed into this location: /root/anaconda2 - Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below [/root/anaconda2] >>> /usr/local/bin/anaconda2
And I have updated my .bashrc
Do you wish the installer to prepend the Anaconda2 install location to PATH in your /root/.bashrc ? [yes|no] [no] >>> yes
After the installation is completed check your path and reinitialise it.
# cat .bashrc .... # added by Anaconda2 4.0.0 installer export PATH="/usr/local/bin/anaconda2/bin:$PATH" # echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin # which python /usr/bin/python
This is the default installed python with CentOS, we need the anaconda python to be default.
# cd ~ # . .bashrc # echo $PATH /usr/local/bin/anaconda2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin # which python /usr/local/bin/anaconda2/bin/python # python Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec 6 2015, 18:08:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org >>> quit() #
Installing The Jupyter Notebook
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more for doing datascience. (source: http://jupyter.org/)
# conda install jupyter
The notebook will default run on localhost and you need to start it by hand. I’ve created a systemd unit file to start it automaticly and runs on a different user.
First create the user:
# useradd jupyternb # su - jupyternb # mkdir notebooks # exit
Create the system unit file for automatically start on boot.
# vi /usr/lib/systemd/system jupyter-notebook.service
In vi press i to enable insert and copy paste below, at the end press
[Unit] Description=The Jupyter HTML Notebook [Service] Type=simple PIDFile=/var/run/jupyter-notebook.pid ExecStart=/usr/local/bin/anaconda2/bin/jupyter notebook --no-browser --ip=* User=jupyternb Group=jupyternb WorkingDirectory=/home/jupyternb/notebooks [Install] WantedBy=multi-user.target
Now we created the unit file we only need to reload the inits and enable the system unit file fo the Jupyter Notebook.
# systemctl daemon-reload # systemctl enable jupyter-notebook Created symlink from /etc/systemd/system/multi-user.target.wants/jupyter-notebook.service to /usr/lib/systemd/system/jupyter-notebook.service. # systemctl start jupyter-notebook # systemctl status jupyter-notebook ● jupyter-notebook.service - The Jupyter HTML Notebook Loaded: loaded (/usr/lib/systemd/system/jupyter-notebook.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2016-05-16 16:23:41 CEST; 2s ago Main PID: 2347 (jupyter-noteboo) CGroup: /system.slice/jupyter-notebook.service └─2347 /usr/local/bin/anaconda2/bin/python /usr/local/bin/anaconda2/bin/jupyter-notebook --no-browser --ip=* May 16 16:23:41 dss.home systemd: Starting The Jupyter HTML Notebook... May 16 16:23:41 dss.home systemd: Started The Jupyter HTML Notebook. May 16 16:23:42 dss.home jupyter: [W 16:23:42.101 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. May 16 16:23:42 dss.home jupyter: [W 16:23:42.101 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure... recommended. May 16 16:23:42 dss.home jupyter: [I 16:23:42.108 NotebookApp] Serving notebooks from local directory: /home/jupyternb/notebooks May 16 16:23:42 dss.home jupyter: [I 16:23:42.109 NotebookApp] 0 active kernels May 16 16:23:42 dss.home jupyter: [I 16:23:42.109 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/ May 16 16:23:42 dss.home jupyter: [I 16:23:42.109 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). Hint: Some lines were ellipsized, use -l to show in full.
If the daemon started you can connect to http://your-server-name-here:8888 and shows your home screen.
For detail configuration, like encryption and authentication you can check the official Jupiter documentation here http://jupyter-notebook.readthedocs.io/en/latest/
Installing R and Rstudio
R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. (source: https://www.r-project.org/)
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. (source: https://www.rstudio.com/)
For installing R we need to install the Extra Packages for Enterprise Linux (EPEL) repo.
# yum -y install epel-release
Refresh the repo
# yum repolist
Now we can install R
# yum -y install R
This will install about 390 packages, so get a cup of coffee
Install 1 Package (+389 Dependent packages) Total download size: 337 M Installed size: 819 M
If R is installed we can install Rstudio-server
I have used the instructions from https://www.rstudio.com/products/rstudio/download-server-2/ This package has a total size of 280M
# yum install --nogpgcheck https://download2.rstudio.org/rstudio-server-rhel-0.99.896-x86_64.rpm
If everything went fine you can connect to your server with the following URL, and you will see a sign in screen.
See the official Getting Started document for information configuring and managing the server.
MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. (source: https://www.mongodb.com/)
I have MongoDB installed as described on the MongoDB site you can find it here. Why should I write it again :). I installed CentOS 7, and therefore I used the Red Hat installation guide. To determine which platform you run, check it with the following command on CentOS.
# getconf LONG_BIT 64 #
As you can see we are running on a 64-bit platform, thats fine because the installation guide only supports 64-bit systems
Disable SELinux by setting the SELINUX setting to disabled in /etc/selinux/config.
You can check if MongoDB is running and listen on the tcp port.
# ps aux | grep -i mongo mongod 1138 0.5 8.3 1502560 243764 ? Sl Jul04 88:26 /usr/bin/mongod -f /etc/mongod.conf # netstat -atnp | grep -i mongo tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN 1138/mongod
You see servers and devices, apps and logs, traffic and clouds. We see data—everywhere. Splunk®offers the leading platform for Operational Intelligence. It enables the curious to look closely at what others ignore—machine data—and find what others never see: insights that can help make your company more productive, profitable, competitive and secure. What can you do with Splunk?
For downloading Splunk® you need to create an account on www.splunk.com.
You can get your free Splunk® Enterprise here: https://www.splunk.com/en_us/download/splunk-enterprise.html
We need to choose Linux, than the 64-bits, stop the download because we are going it to download with the wget command. On the right side we can find “Got wget?”, press that and copy the URL into your linux console to download the rpm package.
If Splunk® is downloaded than install it with rpm
# rpm -Uvh splunk-6.4.1-debde650d26e-linux-2.6-x86_64.rpm warning: splunk-6.4.1-debde650d26e-linux-2.6-x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 653fb112: NOKEY Preparing... ################################# [100%] useradd: cannot create directory /opt/splunk Updating / installing... 1:splunk-6.4.1-debde650d26e ################################# [100%] complete #
After install go the the directory where Splunk® is installed and start it. We accept the license directly with the start.
# cd /opt/splunk/bin # ./splunk start --accept-license This appears to be your first time running this version of Splunk. Copying '/opt/splunk/etc/openldap/ldap.conf.default' to '/opt/splunk/etc/openldap/ldap.conf'. Generating RSA private key, 1024 bit long modulus .......++++++ ..++++++ e is 65537 (0x10001) writing RSA key Generating RSA private key, 1024 bit long modulus ........++++++ .........++++++ e is 65537 (0x10001) writing RSA key Moving '/opt/splunk/share/splunk/search_mrsparkle/modules.new' to '/opt/splunk/share/splunk/search_mrsparkle/modules'. Splunk> Winning the War on Error Checking prerequisites... Checking http port : open Checking mgmt port : open Checking appserver port [127.0.0.1:8065]: open Checking kvstore port : open Checking configuration... Done. Creating: /opt/splunk/var/lib/splunk Creating: /opt/splunk/var/run/splunk Creating: /opt/splunk/var/run/splunk/appserver/i18n Creating: /opt/splunk/var/run/splunk/appserver/modules/static/css Creating: /opt/splunk/var/run/splunk/upload Creating: /opt/splunk/var/spool/splunk Creating: /opt/splunk/var/spool/dirmoncache Creating: /opt/splunk/var/lib/splunk/authDb Creating: /opt/splunk/var/lib/splunk/hashDb Checking critical directories... Done Checking indexes... Validated: _audit _internal _introspection _thefishbucket history main summary Done New certs have been generated in '/opt/splunk/etc/auth'. Checking filesystem compatibility... Done Checking conf files for problems... Done Checking default conf files for edits... Validating installed files against hashes from '/opt/splunk/splunk-6.4.1-debde650d26e-linux-2.6-x86_64-manifest' All installed files intact. Done All preliminary checks passed. Starting splunk server daemon (splunkd)... Generating a 1024 bit RSA private key ........++++++ ....++++++ writing new private key to 'privKeySecure.pem' ----- Signature ok subject=/CN=dss.home/O=SplunkUser Getting CA Private Key writing RSA key Done [ OK ] Waiting for web server at http://127.0.0.1:8000 to be available.... Done If you get stuck, we're here to help. Look for answers here: http://docs.splunk.com The Splunk web interface is at http://your-server-name-here:8000
You can now connect to http://your-server-name-here:8000
Configure Splunk® to start automatically
# cd /opt/splunk/bin # ./splunk enable boot-start
Well thats it, you we can now start to gather some data and doing some datascience.
If you have some questions, follow me on Twitter or mail me, in the footer you can find my contact information.
The post Setting up a DataScience Server appeared first on Networkx.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.