On a recent project, we needed to connect to a MS SQL Server instance from R. Project members were mostly running some kind of Unix-like operating system. Connecting to a Microsoft product from R running on a Unix-like operating system was new to us. Documentation on how to do this was hard to find. This post describes how we configured our environment.
Environment and Software Versions
This post will use two virtual machines to explain how we configured our development environment. The first VM runs the MS SQL Server database instance. The second VM runs a CentOS 7.0 minimal OS installation where R is configured to connect to MS SQL Server running on the first VM. The project also required that we connect to the database server from an OS X machine. We successfully did this by following a similar installation path and using Xcode to compile unixODBC and the FreeTDS driver. The OS X installation steps will not be described in this post but they are very similar to the installation process followed under CentOS.
VM and Version Info
Here’s are some details on the virtual machines that were configured for this post:
VM1 - Database Server (Windows 8)
- SQL Server 2016 - Express Edition
- SQL Server Management Studio
VM2 - Development Box (CentOS)
- CentOS 7.0 - Minimal Install
- GCC - Version 4.8.5
- R - Version 3.3.1
- unixODBC - Version 2.3.4
- FreeTDS Driver - Version 1.00.15
VM1 - Configuring the Database Server
This section describes how MS SQL Server is configured on the first VM. In a production environment, chances are that you would be connecting to a more robust installation of SQL Server than what I describe here. The basic installation described here is for illustration purposes only. I don’t go into detail because I rarely use Windows anymore and I suspect that I’m missing important steps for a robust MS SQL Server installation. At a very high level, the MS SQL Server instance is configured with the following characteristics :
- Create a Windows 8.1 Enterprise virtual machine on VirtualBox
- Install SQL Server Express 2016 core in
mixed authentication mode
- The installation process creates
sauser with database administration privileges
- Open firewall port 1433 to allow inbound traffic
- Enable SQL Server protocol TCP IP - 1
- Configure SQL Server for remote access. Ensure all TCP ports mentioned on all interfaces are set to 1433 - 2
- Install SQL Server 2016 Management Studio
Database and User Creation
To create a test database (called test_db) on our SQL Server VM, these commands are executed in SQL Server Management Studio. To run these SQL commands, the
sa user is used to login to the database prior to executing the commands.
-- Create the test_db database. IF EXISTS (SELECT [name] FROM sys.databases WHERE [name] = 'test_db') BEGIN DROP DATABASE test_db END CREATE DATABASE test_db ; USE test_db ; --Create a user called test_db_user CREATE LOGIN test_db_user WITH PASSWORD = ‘password123!’ ; CREATE USER [test_db_user] FOR LOGIN [test_db_user] ; EXEC sp_addrolemember N’db_owner’, N’test_db_user’ ;
After running the commands above, connecting to the new
test_db database using the
password123! user name and password combination is now possible. A
person table was created in the
test_db database for this post as seen in the SQL Server Management Studio screen shot below.
A few records were added to the
person table, which we can be seen from the following query results.
VM2 - Configuring the Development Box
Once the SQL Server instance is installed and the test_db database is created on the first VM, the second VM can be configured to run R and connect to this database.
Base Packages Install
The commands below install the wget utility, which is used to fetch the EPEL repository RPM. The R package is one of the add-on packages found in the EPEL repository. The EPEL repository provides software packages for distributions like RedHat Enterprise Linux, CentOS and others. The easiest way to install gcc, make and company is to install the `Development Tools” package group using yum.
# sudo install wget # wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-8.noarch.rpm # sudo rpm -ivh epel-release-7-8.noarch.rpm # rm epel-release-7-8.noarch.rpm # sudo yum group install "Development Tools"
ODBC stands for Open Database Connectivity. It’s a popular database abstraction API where ODBC drivers are implemented to provide access to different database back ends. I first heard of it when I was supporting Windows systems but it’s on Unix-like system as well. The steps below describe how unixODBC is installed on the CentOS VM.
# mkdir ~/_odbc_work # cd ~/_odbc_work # wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz # tar -zxvf unixODBC-2.3.4.tar.gz # cd unixODBC-2.3.4 # ./configure # make # sudo make install
FreeTDS Driver Installation
A few companies offer ODBC drivers for Linux that enable connections to SQL Server. The FreeTDS works just fine and … it’s free! Note that it’s a little more difficult to dig up documentation on how to install and configure the driver. Hopefully this post will fill a gap. The FreeTDS driver is installed using the following commands :
# cd ~/_odbc_work # wget ftp://ftp.freetds.org/pub/freetds/stable/freetds-patched.tar.gz # tar -zxvf freetds-patched.tar.gz # cd freetds-1.00.15 # ./configure --with-unixodbc=/usr/local # make # sudo make install
Check FreeTDS Environment
Once the FreeTDS driver is installed, the configuration can be verified with the command below :
# /usr/local/bin/tsql -C
The results should show something like :
Compile-time settings (established with the "configure" script) Version: freetds v1.00.15 freetds.conf directory: /usr/local/etc MS db-lib source compatibility: no Sybase binary compatibility: no Thread safety: yes iconv library: yes TDS version: auto iODBC: no unixodbc: yes ←-- !!! We need to see this yes !!! SSPI "trusted" logins: no Kerberos: no OpenSSL: no GnuTLS: no MARS: no
Add odbc.ini Entry
A unixODBC configuration needs to be added to reflect that the FreeTDS driver is used to connect to the
# sudo vi /usr/local/etc/odbc.ini
The contents for the new
test_db connection should look like :
[test_db] Description=FreeTDS Driver=FreeTDS Servername= Server=192.168.2.33 ←- !!! Change to the IP for your context !!! Address= Port=1433 Database=test_db TDS_Version=7.3 Language=us_english TextSize= PacketSize= ClientCharset= DumpFile= DumpFileAppend= DebugFlags= Encryption=off
Add an odbcinst.ini Entry for the TDS Driver
The FreeTDS driver needs to be registered with unixODBC.
# sudo vi /usr/local/etc/odbcinst.ini
The contents for the FreeTDS driver entry should look like :
[FreeTDS] Description = FreeTDS Driver=/usr/local/lib/libtdsodbc.so Setup=/usr/local/lib/libtdsodbc.so UsageCount = 1
Check that unixODBC knows about the driver
Once the configurations are done, it’s easy to check if unixODBC knows about the FreeTDS driver with the following command :
# /usr/local/bin/odbcinst -q -d
Results of running the command should look like :
[FreeTDS] <----------!!! Need to see this !!!!
Test the Connection to SQL Server
To ensure that there are no firewall issues between the client machine (VM2) and the SQL Server machine (VM1), one can try to telnet to the SQL Server machine on port 1433.
# telnet 192.168.2.33 1433
If there is something listening on that port and there are no firewall issues, there should be a response that looks like the following after issuing the telnet command :
Escape character is '^'.
Try out the Connection With tsql
Once it has been determined that there are no firewall issues between the client and server machines, the following command opens up tsql command prompt where queries can be sent to the SQL Server :
# tsql -S 192.168.2.33 -U test_db_user -P password123! -D test_db
The result of running the above command should look like :
locale is "en_US.utf8" local charset is "UTF-8" using default charset "UTF-8" Setting test_db as default database in login packet
Enter these commands in the shell :
1> select * from person 2> go
The results of these commands should look something like :
id first_name last_name 1 jo walsh 3 that person (2 rows affected)
Once the ODBC connection to the SQL Server is verified, R and the RODBC package can be installed. R is installed with the following command :
# sudo yum install R
Before RODBC is installed, there is a small problem to fix. Otherwise, the following error will show up as RODBC is compiled from source :
Error : configure: error: cannot compute sizeof (long), 77 See `config.log' for more details. ERROR: configuration failed for package ‘RODBC’
To fix this error, these lines need to be appended to the ~/.bash_profile file :
export ODBC_INCLUDE=/usr/local/include export OCBC_LIBS=/usr/local/lib export LD_LIBRARY_PATH=/usr/lib64:/usr/lib/gcc/x86_64-redhat-linux/4.8.5:/usr/local/lib
And sourced for the new environment variables to take :
# source ~/.bash_profile
The RODBC package can be installed using these commands :
# R > install.packages('RODBC')
Run RODBC Test Script
Once the RODBC package is installed, the R code below should query the person table and populate the results variable :
> library ('RODBC') > channel<- odbcConnect("test_db", uid="test_db_user", pwd="password123!") > result <- sqlQuery(channel, paste("SELECT * FROM person")) > result > close(channel)
results variable above should look something like :
> result id first_name last_name 1 1 jo walsh 2 3 that person
If you have to use a MS SQL Server database for a project, it’s nice to be able to fetch the data directly and not use some intermediary step, like exporting the data to text files and then loading these in R. unixODBC, the FreeTDS driver and RODBC provided a nice way for us to access the Microsoft data store directly and stay on Unix-like systems without having to worry too much about where the data is coming from.
Spark and SparkR provide the ability connect to different data sources like JSON text files and relational databases. I will explore this a bit next post.