Working with Relational Databases in R

On a recent project, we needed to connect to a MS SQL Server instance from R. Project members were mostly running some kind of Unix-like operating system. Connecting to a Microsoft product from R running on a Unix-like operating system was new to us. Documentation on how to do this was hard to find. This post describes how we configured our environment.

Environment and Software Versions

This post will use two virtual machines to explain how we configured our development environment. The first VM runs the MS SQL Server database instance. The second VM runs a CentOS 7.0 minimal OS installation where R is configured to connect to MS SQL Server running on the first VM. The project also required that we connect to the database server from an OS X machine. We successfully did this by following a similar installation path and using Xcode to compile unixODBC and the FreeTDS driver. The OS X installation steps will not be described in this post but they are very similar to the installation process followed under CentOS.

VM and Version Info

Here’s are some details on the virtual machines that were configured for this post:

  • VM1 - Database Server (Windows 8)

    • SQL Server 2016 - Express Edition
    • SQL Server Management Studio
  • VM2 - Development Box (CentOS)

    • CentOS 7.0 - Minimal Install
    • GCC - Version 4.8.5
    • R - Version 3.3.1
    • unixODBC­ - Version 2.3.4
    • FreeTDS Driver - Version 1.00.15

VM1 - Configuring the Database Server

This section describes how MS SQL Server is configured on the first VM. In a production environment, chances are that you would be connecting to a more robust installation of SQL Server than what I describe here. The basic installation described here is for illustration purposes only. I don’t go into detail because I rarely use Windows anymore and I suspect that I’m missing important steps for a robust MS SQL Server installation. At a very high level, the MS SQL Server instance is configured with the following characteristics :

  • Create a Windows 8.1 Enterprise virtual machine on VirtualBox
  • Install SQL Server Express 2016 core in mixed authentication mode
  • The installation process creates sa user with database administration privileges
  • Open firewall port 1433 to allow inbound traffic
  • Enable SQL Server protocol TCP IP - 1
  • Configure SQL Server for remote access. Ensure all TCP ports mentioned on all interfaces are set to 1433 - 2
  • Install SQL Server 2016 Management Studio

Database and User Creation

To create a test database (called test_db) on our SQL Server VM, these commands are executed in SQL Server Management Studio. To run these SQL commands, the sa user is used to login to the database prior to executing the commands.

-- Create the test_db database. 
IF EXISTS (SELECT [name] FROM sys.databases WHERE [name] = 'test_db')
BEGIN
	DROP DATABASE test_db
END

CREATE DATABASE test_db
;

USE test_db
;

--Create a user called test_db_user
CREATE LOGIN test_db_user WITH PASSWORD = ‘password123!’
;

CREATE USER [test_db_user] FOR LOGIN [test_db_user]
;

EXEC sp_addrolemember N’db_owner’, N’test_db_user’
;

After running the commands above, connecting to the new test_db database using the test_db_user / password123! user name and password combination is now possible. A person table was created in the test_db database for this post as seen in the SQL Server Management Studio screen shot below.

A few records were added to the person table, which we can be seen from the following query results.

VM2 - Configuring the Development Box

Once the SQL Server instance is installed and the test_db database is created on the first VM, the second VM can be configured to run R and connect to this database.

Base Packages Install

The commands below install the wget utility, which is used to fetch the EPEL repository RPM. The R package is one of the add-on packages found in the EPEL repository. The EPEL repository provides software packages for distributions like RedHat Enterprise Linux, CentOS and others. The easiest way to install gcc, make and company is to install the `Development Tools” package group using yum.

# sudo install wget
# wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-8.noarch.rpm
# sudo rpm -ivh epel-release-7-8.noarch.rpm
# rm epel-release-7-8.noarch.rpm
# sudo yum group install "Development Tools"

unixODBC Installation

ODBC stands for Open Database Connectivity. It’s a popular database abstraction API where ODBC drivers are implemented to provide access to different database back ends. I first heard of it when I was supporting Windows systems but it’s on Unix-like system as well. The steps below describe how unixODBC is installed on the CentOS VM.

# mkdir ~/_odbc_work
# cd ~/_odbc_work
# wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz
# tar -zxvf unixODBC-2.3.4.tar.gz
# cd unixODBC-2.3.4
# ./configure
# make
# sudo make install

FreeTDS Driver Installation

A few companies offer ODBC drivers for Linux that enable connections to SQL Server. The FreeTDS works just fine and … it’s free! Note that it’s a little more difficult to dig up documentation on how to install and configure the driver. Hopefully this post will fill a gap. The FreeTDS driver is installed using the following commands :

# cd ~/_odbc_work
# wget  ftp://ftp.freetds.org/pub/freetds/stable/freetds-patched.tar.gz
# tar -zxvf freetds-patched.tar.gz
# cd freetds-1.00.15
# ./configure --with-unixodbc=/usr/local
# make
# sudo make install

Check FreeTDS Environment

Once the FreeTDS driver is installed, the configuration can be verified with the command below :

# /usr/local/bin/tsql -C

The results should show something like :

Compile-time settings (established with the "configure" script)
                            Version: freetds v1.00.15
             freetds.conf directory: /usr/local/etc
     MS db-lib source compatibility: no
        Sybase binary compatibility: no
                      Thread safety: yes
                      iconv library: yes
                        TDS version: auto
                              iODBC: no
                           unixodbc: yes  ←-- !!! We need to see this yes !!!
              SSPI "trusted" logins: no
                           Kerberos: no
                            OpenSSL: no
                             GnuTLS: no
                               MARS: no

Add odbc.ini Entry

A unixODBC configuration needs to be added to reflect that the FreeTDS driver is used to connect to the test_db database.

# sudo vi /usr/local/etc/odbc.ini

The contents for the new test_db connection should look like :

[test_db]
Description=FreeTDS
Driver=FreeTDS
Servername=
Server=192.168.2.33  ←- !!! Change to the IP for your context !!!
Address=
Port=1433
Database=test_db
TDS_Version=7.3
Language=us_english
TextSize=
PacketSize=
ClientCharset=
DumpFile=
DumpFileAppend=
DebugFlags=
Encryption=off

Add an odbcinst.ini Entry for the TDS Driver

The FreeTDS driver needs to be registered with unixODBC.

# sudo vi /usr/local/etc/odbcinst.ini

The contents for the FreeTDS driver entry should look like :

[FreeTDS]
Description = FreeTDS
Driver=/usr/local/lib/libtdsodbc.so
Setup=/usr/local/lib/libtdsodbc.so
UsageCount = 1

Check that unixODBC knows about the driver

Once the configurations are done, it’s easy to check if unixODBC knows about the FreeTDS driver with the following command :

# /usr/local/bin/odbcinst -q -d

Results of running the command should look like :

[FreeTDS] <----------!!! Need to see this !!!!

Test the Connection to SQL Server

To ensure that there are no firewall issues between the client machine (VM2) and the SQL Server machine (VM1), one can try to telnet to the SQL Server machine on port 1433.

# telnet 192.168.2.33 1433

If there is something listening on that port and there are no firewall issues, there should be a response that looks like the following after issuing the telnet command :

Escape character is '^'.

Try out the Connection With tsql

Once it has been determined that there are no firewall issues between the client and server machines, the following command opens up tsql command prompt where queries can be sent to the SQL Server :

# tsql -S 192.168.2.33 -U test_db_user -P password123! -D test_db

The result of running the above command should look like :

locale is "en_US.utf8"
local charset is "UTF-8"
using default charset "UTF-8"
Setting test_db as default database in login packet

Enter these commands in the shell :

1> select * from person
2> go

The results of these commands should look something like :

id	first_name	last_name
1	jo        	walsh     
3	that      	person    
(2 rows affected)

R Installation

Once the ODBC connection to the SQL Server is verified, R and the RODBC package can be installed. R is installed with the following command :

# sudo yum install R

RODBC Installation

Before RODBC is installed, there is a small problem to fix. Otherwise, the following error will show up as RODBC is compiled from source :

Error :
configure: error: cannot compute sizeof (long), 77 See `config.log' for more details. ERROR: configuration failed for package ‘RODBC’

To fix this error, these lines need to be appended to the ~/.bash_profile file :

export ODBC_INCLUDE=/usr/local/include
export OCBC_LIBS=/usr/local/lib
export LD_LIBRARY_PATH=/usr/lib64:/usr/lib/gcc/x86_64-redhat-linux/4.8.5:/usr/local/lib

And sourced for the new environment variables to take :

# source ~/.bash_profile

The RODBC package can be installed using these commands :

# R
> install.packages('RODBC')

Run RODBC Test Script

Once the RODBC package is installed, the R code below should query the person table and populate the results variable :

> library ('RODBC')
> channel<- odbcConnect("test_db", uid="test_db_user", pwd="password123!")
> result <- sqlQuery(channel, paste("SELECT * FROM person"))
> result
> close(channel)

The results variable above should look something like :

> result
   id  first_name  last_name
1  1 jo          walsh     
2  3 that        person  

Final Thoughts

If you have to use a MS SQL Server database for a project, it’s nice to be able to fetch the data directly and not use some intermediary step, like exporting the data to text files and then loading these in R. unixODBC, the FreeTDS driver and RODBC provided a nice way for us to access the Microsoft data store directly and stay on Unix-like systems without having to worry too much about where the data is coming from.

Spark and SparkR provide the ability connect to different data sources like JSON text files and relational databases. I will explore this a bit next post.

Related Links

comments powered by Disqus