Surfing Linux, Big Data and Data Science

Tuesday, December 11, 2018

Creating Ubuntu Vagrant Boxes for CentOS & Fedora (libvirt)

If you need Ubuntu Vagrant boxes but are working on a Red Hat-based Linux distribution (RHEL, CentOS, Fedora, ArchLinux etc.) then you are faced with a dilemma. All of the official Ubuntu Vagrant boxes available in the Vagrant Cloud are in Virtualbox format but this format has no built-in support on Red Hat etc. Yes, you could install Virtualbox directly from the Oracle website but this is a one-off event and doesn't allow you to get automatic updates etc. In addition, Red Hat has an excellent (the best?) hypervisor in the form of KVM which is also available from the official repositories on CentOS, Fedora etc.

Security Aside: The Vagrant Cloud website has many Ubuntu boxes, some with convenient applications already installed so you don't have to waste time doing this yourself. But think twice before you grab one of those: do you know who created it? do you trust them? are they really who they claim to be? what else did they install on that box? As convenient as these pre-prepared boxes are, they are an easy way to sneak nasty things onto your network. So to be safe, only use the official boxes from Canonical, CentOS etc.

OK, back to the task at hand - how to convert an Ubuntu Vagrant box (designed for the virtualbox provider ) to a box that can be used on RHEL/CentOS/Fedora (with KVM/libvirt provider)? The basic process is

Get the Vagrant box file
Unbundle this Vagrant box file
Convert the disk image to a format libvirt can understand
Create config files for your new box
Bundle together your new config files and disk image into a new box
Add to Vagrant

Download the box

The first hurdle in the road is simply finding what you want to download! Even though you can see all these boxes on the Vagrant Cloud site, they are not actually directly downloadable from Vagrant. However, on the more detailed page for each box, like this one for Ubuntu Bionic 18.04, you will see next to each provider-specific box that these boxes are actually hosted on an Ubuntu site. Going there you can navigate to "daily vagrant builds" and here you will find the most up-to-date Vagrant box of Ubuntu 18.04.

Unbundle the box

The Vagrant box file is nothing but a *.tar file (but has a .box extension). Looking at the contents you will see one very large file (the actual disk image) and some small metadata and config files.

$ tar tvf bionic-server-cloudimg-amd64-vagrant.box

-rw-r--r-- root/root 11061 2018-12-05 09:08 box.ovf

-rw-r--r-- root/root 478 2018-12-05 09:08 Vagrantfile

-rw-r--r-- root/root 31 2018-12-05 09:08 metadata.json

-rw-r--r-- root/root 310 2018-12-05 09:08 ubuntu-bionic-18.04-cloudimg.mf

-rw-r--r-- root/root 316687872 2018-12-05 09:08 ubuntu-bionic-18.04-cloudimg.vmdk

-rw-r--r-- root/root 72192 2018-12-05 09:08 ubuntu-bionic-18.04-cloudimg-configdrive.vmdk

Now we can extract this box with

$ tar xvf bionic-server-cloudimg-amd64-vagrant.box

Convert the disk image

We now need to convert the disk image to *.qcow2 format, which is native to KVM. To do this we will use a great little tool from the libvirt suit, qemu-img. The command line options are pretty straightforward, you need to tell it what in input format to expect, what output format to produce and the names of the input and output files.

qemu-img convert -f vmdk -O qcow2 ubuntu-bionic-18.04-cloudimg.vmdk box.img

$ qemu-img info box.img
image: box.img
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 1.0G
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

Create config files

According to the docs of vagrant-libvirt provider, the box tarball consists of only three items

A *.qcow2 image that is named box.img
A metadata.json file describing box image. This file must have the following three pieces of information

Provider (libvirt in our case)
Disk image format (qcow2 in our case)
Virtual_size (the size in GB to which the qcow2 image can grow, this will not be the actual size of the file created in the disk conversion step above)

A Vagrantfile that defines default settings for the libvirt provider (all of these defaults can be over-written by a project-specific Vagrantfile

Modify the metadata.json file so that it looks like this (customize the disk size if you wish)

$ cat metadata.json

{

"provider": "libvirt",

"format": "qcow2",

"virtual_size": 16

}

Modify the Vagrantfile to be as follows

$ cat Vagrantfile
Vagrant.configure("2") do |config|
config.vm.provider :libvirt do |libvirt|
libvirt.driver = "kvm"
libvirt.host = ""
libvirt.connect_via_ssh = false
libvirt.storage_pool_name = "default"
end
end

Create new box

Rename the qcow2 image to box.img and then tar up the three parts into a new box called ubuntu-1804.box

$ mv ubuntu-1804.qcow2 box.img

tar cvzf ubuntu-1804.box -S --totals ./metadata.json ./Vagrantfile ./box.img

Add new box to Vagrant

When adding the box it to vagrant it will unbundle it again and store the disk image and config & metadata files in the correct locations (default location is ~/.vagrant.d/)

$ vagrant box add ubuntu-1804.box --name ubuntu-1804 --provider libvirt

==> box: Box file was not detected as metadata. Adding it directly...

==> box: Adding box 'ubuntu-1804' (v0) for provider: libvirt

box: Unpacking necessary files from: file:///home/brett/tmp/ubuntu-1804.box

==> box: Successfully added box 'ubuntu-1804' (v0) for 'libvirt'!

You can see all the boxes installed (along with the provider and box version number) in Vagrant by doing

$ vagrant box list

centos/7 (libvirt, 1809.01)

ubuntu-1804 (libvirt, 0)

How to use this new Ubuntu-under-libvirt box

Now just create a simple Vagrantfile and point it to your new Ubuntu/KVM box

Vagrant.configure("2") do |config|
config.vm.box = "centos/7"

config.vm.provider :libvirt do |libvirt|
libvirt.memory = 1024
end

And now all you have to do is vagrant up and you'll be running an Ubuntu Vagrant VM on Centos!

Friday, September 12, 2008

Setting Up TPC-H: Part 2

Now that we have the tools to create the data, lets create a place in the database to put it and then insert the data.

Step 1: Create TPCH Tablespace and user/schema
Create both a dedicated tablespace and schema to contain/access this data - here it is simply called TPCH. As the SYS user

SQL> CREATE SMALLFILE TABLESPACE "TPCH" DATAFILE '/u01/app/oracle/oradata/BRS01/tpch.dbf' SIZE 1000M AUTOEXTEND ON NEXT 10M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;

SQL> CREATE USER "TPCH" PROFILE "DEFAULT" IDENTIFIED BY "password" DEFAULT TABLESPACE "TPCH" TEMPORARY TABLESPACE "TEMP" QUOTA UNLIMITED ON "TPCH" ACCOUNT UNLOCK;
SQL> GRANT "CONNECT" TO "TPCH";
SQL> GRANT CREATE TABLE TO "TPCH";
SQL> GRANT CREATE VIEW TO "TPCH";

Step 2: Create the tables
TPCH comes with two files (dss.ddl and dss.ri) that contain the DDL and referential integrity setup. However, since we will use "direct path"option of sqlloader to put the data into the database, it doesn't make sense to have any primary or foreign keys in place when loading. Run this script (as the tpch user) to create all the required tables.

Step 3: Generate and load data into database
Jeff Moss has put together a "wrapper" script that uses dbgen to create and store the data in flat files and then calls Oracle's sqlldr to put the data into the database - see details here. Here is what needs to be done

Download the control files (*.ctl) and the two scripts and put them in same directory as tpch i.e. where dbgen and qgen are located. Due the wiki tool used on Jeff's page the naming is a bit mangled - just rename using lower case and use .ctl or .sh extension.
Run the scripts, following Jeff's examples almost verbatim. Obviously, use a connection string appropriate for your own database and pay attention to the last parameter - it is the total number of (parent + child) processes created or parallel streams used to create and load the data. Too high a number here can very easily bring a system to it's knees - my rule of thumb is to make this equal to the number of CPU cores. The first parameter is the TPCH Scale Factor: 1 ~ 1 GB database, 10 ~ 10 GB database etc.

Step 4: Create primary keys, foreign keys and indexes
These constraints are specified in the dss.ri file of TPCH. Unfortunately, some syntactic idiosyncrasies and outdated schema names mean that this is not simply plug 'n play on Oracle. To make life easier I created a Oracle-compatible script that will setup all the primary and foreign keys - the script is here.

Setting Up TPC-H: Part 1

TPC-H is the data warehouse benchmark of the Transaction Processing Council (their web site has lots of results submitted by vendors trying to display the prowess of their hardware and/or software). As in the case of all benchmarks, TPC-H is not perfect - it's not even a star schema so doesn't really represent 99% of real data warehouses, comparing results between systems, groups, companies etc is fraught with difficulty and complication but we're gonna do it anyways! Just remember all the usual benchmark caveats.

Download and untar the files from the TPC-H web site then make a copy/rename makefile.suite to makefile and edit the four lines that specify the compiler on your system (CC), database, machine and workload

CC = gcc
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH

When setting the database you'll notice that there is no predefined type for Oracle. Huh? The company that has 45% of the RDBMS market is not listed here? Either Oracle requested this or the TPC guys are extremely biased in favor of IBM or Microsoft (their web site does use ASP.NET :-) Because of this we need to define an Oracle section ourselves. Edit tpcd.h and add section for Oracle (with all variables defined to empty strings, this is the simplest setup that works)

#ifdef ORACLE
#define GEN_QUERY_PLAN ""
#define START_TRAN ""
#define END_TRAN ""
#define SET_OUTPUT ""
#define SET_ROWCOUNT ""
#define SET_DBASE ""
#endif /* ORACLE */

Then just type make to compile, this will generate two executables dbgen and qgen which, respectively, are used to generate flat files for loading into the database and the queries to run. See the README for gory details.