Tuesday, November 4, 2014


Assumption and purpose:
This is an attempt to compare two promising open source Object Store Technologies purely based on performance. The use case kept in mind is large or small scale public cloud storage provider & the attempt here is evaluate the best technology for said use case.
Feature delta between OpenStack Swift and Ceph Object Store is ignored here.  Ceph is viewed only as Object Store serving Objects via Swift REST API (not RADOS Objects), Ceph’s other interfaces which provide file and block based access are ignored here.    
Assumption here is both the technologies can be best compared when deployed on same hardware and topology  & tested with same kind of workload. Data caching is discouraged while collecting numbers (Page Cache, Dentries and Inode are flushed every minute on each server). COSBench is used as benchmarking tool.


I got some suggestion to improve Ceph-RGW performance from Ceph community . I tried all of them, they do have some minor impact on the overall Ceph-RGW performance(<3%). However there is nothing that change the overall conclusion of the study.

It would not be called an apple to apple comparison but with multiple RGW-civetweb instances and HA proxy, I was able to get better results with Ceph-RGW. I will be posting them soon.


There are two flavors of Dell Power Edge R620 servers used in the study. For simplicity I will now call them T1 & T2.
CPU: 2x Intel-E5-2680 10C 2.8GHz 25M$ (40 Logical CPU, with HT Enabled)
RAM: 4x 16GB RDIMM, dual rank x4 (64GB)
NIC1: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (For Management)
NIC2: Mellonix Connect-X3, 40 Gigabit Ethernet, Dual Port Full Duplex (For Data)
Storage:  160 GB HDD (For OS).
CPU: 2x Intel-E5-2680 10C 2.8GHz 25M$ (40 Logical CPU, with HT Enabled)
RAM: 8x 16GB RDIMM, dual rank x4 (128GB)
NIC1: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (For Management)
NIC2: Mellonix Connect-X3, 40 Gigabit Ethernet, Dual Port Full Duplex (For Data)
Storage1: 160 GB HDD (For OS)
Storage2:  10x 400GB Optimus Eco™ 2.5” SAS SSDs (4TB)
Interface SAS (4 Phy 6Gb/s)
Interface Ports Dual/Wide

Network Bandwidth Check:

Host-A$ date ; sudo iperf -c XXX.XXX.XXX.B  -p 5001 -P4 -m ; date
Tue Nov  4 13:16:58 IST 2014
Client connecting to XXX.XXX.XXX.B  , TCP port 5001
TCP window size:  325 KByte (default)
[  5] local XXX.XXX.XXX.A  port 43892 connected with XXX.XXX.XXX.B  port 5001
[  3] local XXX.XXX.XXX.A  port 43891 connected with XXX.XXX.XXX.B  port 5001
[  6] local XXX.XXX.XXX.A  port 43893 connected with XXX.XXX.XXX.B  port 5001
[  4] local XXX.XXX.XXX.A  port 43890 connected with XXX.XXX.XXX.B  port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  10.9 GBytes  9.35 Gbits/sec
[  5] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[  3]  0.0-10.0 sec  9.17 GBytes  7.88 Gbits/sec
[  3] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[  6]  0.0-10.0 sec  16.5 GBytes  14.2 Gbits/sec
[  6] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[  4]  0.0-10.0 sec  8.72 GBytes  7.49 Gbits/sec
[  4] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[SUM]  0.0-10.0 sec  45.3 GBytes  38.9 Gbits/sec
Tue Nov  4 13:17:08 IST 2014
Host-B$ date ; sudo iperf -c -p 4001 -P4 -m ; date
Tue Nov  4 13:17:01 IST 2014
Client connecting to, TCP port 4001
TCP window size:  325 KByte (default)
[  4] local XXX.XXX.XXX.B port 59130 connected with XXX.XXX.XXX.A port 4001
[  3] local XXX.XXX.XXX.B port 59131 connected with XXX.XXX.XXX.A port 4001
[  6] local XXX.XXX.XXX.B port 59133 connected with XXX.XXX.XXX.A port 4001
[  5] local XXX.XXX.XXX.B port 59132 connected with XXX.XXX.XXX.A port 4001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  14.6 GBytes  12.6 Gbits/sec
[  4] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[  3]  0.0-10.0 sec  7.90 GBytes  6.79 Gbits/sec
[  3] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[  6]  0.0-10.0 sec  14.7 GBytes  12.7 Gbits/sec
[  6] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[  5]  0.0-10.0 sec  8.40 GBytes  7.21 Gbits/sec
[  5] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)
[SUM]  0.0-10.0 sec  45.7 GBytes  39.2 Gbits/sec
Tue Nov  4 13:17:11 IST 2014

So the total available bandwidth is ~39Gbps(~5GBps) for inbound and ~39Gbps(~5GBps) for  outbound traffic as well.

Topology & Setup:

Ceph setup has two more monitor nodes which are not show here.

Ceph RGW Setup:

OpenStack Swift Setup:

Software Details

General Configuration:

  1. Ubuntu 14.04 (3.13.0-24-generic)
  2. Linux Tuning options for networking configured on all the nodes
#Configs recommended for Mellonix Connect –X3
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_low_latency = 1
net.core.netdev_max_backlog = 250000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
kernel.core_uses_pid = 1

MTU size of 9000 is used along with above options.
  1. A CRON job is configured to flush DRAM cache each minute on each node.

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

Ceph Configurations:-

  1. Ceph Version: 0.87
  2. RGW is used with Apache + FASTCGI as well as CivetWeb.
  3. Apache version : 2.4.7-1ubuntu4.1, with libapache2-mod-fastcgi -2.4.7~0910052141-1.1
  4. Ceph.conf is placed here. This contains the entire ceph optimization configuration, done in the experiment.  
  5. Default region, zone and pools. All .rgw *pools created with default zone are set to use PG_NUM of 4096.
  6. Replica count is set to 3.(Max_Size=3, Min_Size=2)
  7. Apache configuration parameters:
ServerLimit          4096
ThreadLimit           200
StartServers           20
MinSpareThreads        30
MaxSpareThreads       100
ThreadsPerChild       128
MaxClients           4096
MaxRequestsPerChild  10000  
  1. CivetWeb is used with all the default configurations. However ‘rgw_op_thread’ seems to control the CivetWeb’s configuration option ‘num_op_thread’, which is set to 128. This parameter seems to degrade the performance in term of response time, if increased beyond this point. (I tried setting this to 256/512 and it resulted in to CLOSE_WAIT state of more & more HTTP connections). I am hitting a CivetWeb bug related to this problem.   

OpenStack Swift Configurations:

  1. OpenStack Swift Version :  Icehouse/ Swift2.0
  2. Webserver: Default WSGI
  3. All performance optimization is done based on OpenStack Swift deployment guide.
  4. Inode Size of 256K is used all other XFS formatting and mounting options are as per recommendation made in Swift Deployment Guide.
  5. WSGI pipeline is trimmed down and only contains essential middleware.
Proxy Server WSGI pipeline looks like this:
pipeline = healthcheck cache tempauth proxy-server   
  1. Each Storage node is configured as a zone in a region, and on each node there is disk that is dedicatedly used for Account and Container Databases.  All the other disks are used for keeping Objects only. Ring files are populated based on these configurations.
  2. Proxy node only runs the proxy-server & memcached.
  3. Storage node run all the other swift services i.e account-server, container-server, object-server along with supporting services  like auditors, updaters, replicaters.

COSBench & Workload Details

  1. COSBench Version: 0.4.0.e1  
  2. COSBench Controller and driver both are configured on the same machine, as the hardware is capable of sustaining the workload.
  3. Small File/Object workload is as follows:
Object Size: 1MB
Containers: 100
Objects Per Container: 1000
  1. Large File/Object workload is as follows:
Object Size: 1GB
Containers: 10
Objects Per Container: 100

  1. Objects are written once in both the cases.
  2. Every workload is configured to use different COSBench worker count.
  3. For Small File Workload Worker counts are: 32,64,128,256,512
  4. For Large File Workload Worker counts  are: 8,16,32,64,128
  5. Each Workload is executed for 900 Seconds, and objects are read randomly from the available set of Swift Objects.
  6. There is no difference in workloads for Ceph and Swift except the value of generated token. A token was generated after creating Swift users in both cases. This token is provided along with Storage-URL in workload configurations.
  7. Ceph Put all the Swift objects in Single Ceph Pool called ‘.rgw.buckets’.


Small File Workload:

Large File Workload:

Additional Details:

90%RT: Response Time of 90% requests
Max RT: Maximum Response Time taken by all successful requests.  

Small Files

Worker Count
90%RT< 20 ms
Max RT=1450 ms
90%RT< 20 ms
Max RT=1440 ms
90%RT< 30 ms
Max RT=10,230 ms
90%RT< 50 ms
Max RT=2000 ms
90%RT< 60 ms
Max RT= 1,460 ms
90%RT< 30 ms
Max RT=16,336 ms
90%RT< 110 ms
Max RT=3090 ms
90%RT< 120 ms
Max RT=  1,480 ms
90%RT< 70 ms
Max RT=16,380ms
90%RT< 210 ms
Max RT=1760 ms
90%RT< 120 ms
Max RT=  3280 ms
90%RT< 90 ms
Max RT=17,020ms
90%RT< 330 ms
Max RT=33,120 ms
90%RT< 200 ms
Max RT=  20,040 ms
90%RT< 160 ms
Max RT=16,760ms

Large Files:

Worker Count
90%RT< 3,110 ms
Max RT= 4,540 ms
90%RT < 3,060 ms
Max RT= 5,380 ms
90%RT < 6,740 ms
Max RT= 11,210 ms
90%RT< 5,550 ms
Max RT= 7,980  ms
90%RT < 5,780 ms
Max RT= 18,150 ms
90%RT < 8,150 ms
Max RT= 13,710 ms
90%RT< 10,860 ms
Max RT= 11,900 ms
90%RT < 10,970 ms
Max RT= 12,120 ms
90%RT < 9,800 ms
Max RT= 17,810 ms
90%RT< 21,370 ms
Max RT=  24,200 ms
90%RT < 21,190 ms
Max RT= 22,080 ms
90%RT < 19,530 ms
Max RT= 38,760 ms
90%RT< 42,410 ms
Max RT= 43,340 ms
90%RT < 41,590 ms
Max RT= 44,210 ms
90%RT < 46,800 ms
Max RT=74,810 ms


  1. Native Swift behaviour & results curve seems sane. A clear relation between concurrency and throughput is established. 
  2. Ceph-RGW seems to have problem with RGW threading model, a flat throughput curve with increased concurrency is certainly not a good sign.
  3. Native Swift in general performs better in high concurrency environments.
  4. Ceph RGW gives better bandwidth at lower concurrency.
  5. Ceph RGW response time is excellent for Large Objects.
  6. For Small Objects at lower concurrency, Ceph -RGW seems very promising, however there is much to do, as concurrency plays a great role in Web Server environment.
  7. Ceph RGW major bottleneck is WebServer, however CivetWeb  & Apache FASCGI gives comparable numbers. However CivetWeb is better than apache+fcgi in term of response time at high concurrency. CivetWeb has a inherent design limitation, which is already reported here.  
  8. Digging further I also made an attempt to benchmark Ceph using RADOS bench, which directly uses Ceph Objects(Different from Swift Object interface it provides). I ran the bench the same node which is used as COSBench Controller +Driver. So in this case RGW is out of the picture , in summary my observations are as below:
Object Size& Threads
Avg Bandwidth (MB/Sec)
Avg Latency (Sec)
Runtime (Sec)
1M , t=128
1M, t=256
4M, t=128

Bandwidth number can be directly used for representing IOPs. So in summary even RADOS bench is not giving bandwidth beyond ~4GB/s. Strange thing here is it is optimize for 4MB Size, increasing beyond this Object size is not giving higher OPs (bandwidth).

Other Remarks:
  1. Swift is more feature rich in terms of REST API.
  2. S3 API is supported by both.
  3. Finding good documentation is big pain in setting up Ceph.

Friday, October 11, 2013

This document describe how a gluster-swift setup can be authenticated against keystone. This document assume you have two F18 VMs , and all the commands are executed as root user. Now onward these two VM will referred as 'kshost' and 'g4snode'.

A. Install and configure keystone on kshost

1.Get the repo for RDO -Grizzly
#yum install -y http://rdo.fedorapeople.org/openstack-grizzly/rdo-release-grizzly.rpm

2.Install keystone and related packages
#yum install openstack-utils openstack-keystone python-keystoneclient

3.Delete the keystone.db file created in the /var/lib/keystone directory
#rm /var/lib/keystone/keystone.db

4.The following sequence of commands will create a MySQL database named "keystone" and a MySQL user named "keystone" with full access to the "keystone" MySQL database.

  a.) #openstack-db --init --service keystone
  (this may ask you to install MySql-server,and mysql password.)

  b.)Login in to mysql server and configure keystone db :-
  #mysql -u root -p
  Enter the mysql root user's password when prompted.
  c.)Create a MySQL user for the newly-created keystone database that has
  full control of the keystone database.
mysql> GRANT ALL ON keystone.* TO 'keystone'@'%' IDENTIFIED BY [KEYSTONEDB_PASSWORD]';
mysql> GRANT ALL ON keystone.* TO 'keystone'@'localhost' IDENTIFIED BY '[KEYSTONEDB_PASSWORD]';

5. To change the data store to mysql, change the line defining connection in/etc/keystone/keystone.conf like :-
  connection = mysql://keystone:[YOUR_KEYSTONEDB_PASSWORD]@kshost_ip/keystone

Your /etc/keystone/keystone.conf file should contain the following lines if it is properly configured to use the database backend:-

driver = keystone.catalog.backends.sql.Catalog

6.Change admin token with a generated one
  #export ADMIN_TOKEN=$(openssl rand -hex 10)
assume token generated with the above is '012345SECRET99TOKEN012345' and edit the /etc/keystone/keystone.conf as folows:-
  admin_token = 012345SECRET99TOKEN012345

7.By default Keystone will use PKI tokens. To create the signing keys and certificates run:
  #keystone-manage pki_setup
  #chown -R keystone:keystone /etc/keystone/* /var/log/keystone/keystone.log

Note: You can change the pki configs in /etc/keystone/ssl/certs/openssl.conf

8.You can configure keystone to use /var/log/keystone/keystone.log and set the log level to 'DEBUG', by setting appropriate parameter in /etc/keystone/keystone.conf

9.Start keystone service
  #service openstack-keystone start && sudo chkconfig openstack-keystone on

10.Lastly, initialize the new keystone database, as root:
  #keystone-manage db_sync

11.Typically, you would use a username and password to authenticate with the Identity service. However, at this point in the install, we have not yet created a user. Instead, we use the service token to authenticate against the Identity service.

#export OS_SERVICE_TOKEN=012345SECRET99TOKEN012345

12.Now we will create tenant and users.Think of tenant as a swift account (in G4S a swift account maps to a gluster volume).Assuming you are thinking of creating an account named ‘test’. Note the tenant id , a gluster volume of this name has to be mounted on g4snode.

#keystone tenant-create --name test --description "an account/volume for G4S"

      |   Property  |              Value               |
      | description | an account/volume for G4S        |
      | enabled     | True                             |
      | id          | b5815b046cfe47bb891a7b64119e7f80 |
      | name        | test                             |

13.Create users for this tenant(account)

#keystone user-create --tenant-id b5815b046cfe47bb891a7b64119e7f80 --name tester --pass testing

      | Property |              Value               |
      | email    |                                  |
      | enabled  | True                             |
      | id       | a4c2d43f80a549a19864c89d759bb3fe |
      | name     | tester                           |
      | tenantId | b5815b046cfe47bb891a7b64119e7f80 |

14.Create an administrative role based on keystone's default policy.json file, admin.
    (you can think of it as group in tempauth terminology)

# keystone role-create --name admin
      | Property |              Value               |
      | id       | e3d9d157cc95410ea45d23bbbc2e5c10 |
      | name     | admin                            |

15.Grant the admin role to the  ‘tester’ user in the ‘test’ tenant with "user-role-add".

#keystone user-role-add --user-id a4c2d43f80a549a19864c89d759bb3fe --tenant-id b5815b046cfe47bb891a7b64119e7f80 --role-id e3d9d157cc95410ea45d23bbbc2e5c10

16.Create service and endpoint for keystone.
  #keystone service-create --name=keystone --type=identity --description="Identity Service"
|   Property  |              Value               |
| description | Identity Service                 |
| id          | 15c11a23667e427e91bc31335b45f4bd |
| name        | keystone                         |
| type        | identity                         |

  #keystone endpoint-create \
  --region RegionOne \
  --service-id=15c11a23667e427e91bc31335b45f4bd \
  --publicurl=http://kshost_ip:5000/v2.0 \
  --internalurl=http://kshost_ip:5000/v2.0 \
|   Property  |             Value                 |
| adminurl    | http://kshost_ip:35357/v2.0 |
| id          | 11f9c625a3b94a3f8e66bf4e5de2679f  |
| internalurl | http://kshost_ip:5000/v2.0  |
| publicurl   | http://kshost_ip:5000/v2.0  |
| region      | RegionOne                         |
| service_id  | 15c11a23667e427e91bc31335b45f4bd  |

17.Create service and endpoints for G4S.You can think of G4S service endpoints as the base URL for each account.

#keystone service-create --name=gluster-swift --type=object-store --description="G4S Object Storage Service"
|   Property  |              Value               |
| description | G4S Object Storage Service       |
| id          | 272efad2d1234376cbb911c1e5a5a6ed |
| name        | gluster-swift                    |
| type        | object-store                     |

#keystone endpoint-create \
--region RegionOne \
--service-id=272efad2d1234376cbb911c1e5a5a6ed \
--publicurl 'http://g4snode_ip:8888/v1/AUTH_%(tenant_id)s' \
--internalurl 'http://g4snode_ip:8888/v1/AUTH_%(tenant_id)s' \
--adminurl 'http://g4snode_ip:8888/v1'

|   Property  |                       Value                       |
| adminurl    | http://g4snode_ip:8888/v1                    |
| id          | e32b3c4780e51332f9c128a8c208a5a4                  |
| internalurl | http://g4snode_ip:8888/v1/AUTH_%(tenant_id)s |
| publicurl   | http://g4snode_ip:8888/v1/AUTH_%(tenant_id)s |
| region      | RegionOne                                         |
| service_id  | 272efad2d1234376cbb911c1e5a5a6ed                  |

B.Prepare your F18 VM for a G4S node

1.Install glusterfs
  #yum install glusterfs glusterfs-server

2.Get the repo for RDO -Grizzly
  #yum install -y http://rdo.fedorapeople.org/openstack-grizzly/rdo-release-grizzly.rpm

3.Install gluster-swift-plugin (It is not available in F18 official repo yet , but will be soon available there, so for now you need to download the latest rpm from build.gluster.org).

Right RPMs will be available here:-

After download you can install it by :-
  #yum install glusterfs-openstack-swift-1.8.0-7.2.fc19.noarch.rpm

4.Verify all dependencies got installed:-
  #[root@dhcp207-186 ~]# rpm -qa|grep openstack


5.Get your config files correct:-

#cd /etc/swift

#rm -rf account-server container-server object-server proxy-server account-server.conf container-server.conf object-server.conf swift.conf proxy-server.conf

#mv account-server.conf-gluster account-server.conf

#mv container-server.conf-gluster container-server.conf

#mv object-server.conf-gluster object-server.conf

#mv proxy-server.conf-gluster proxy-server.conf

#mv fs.conf-gluster fs.conf

#mv swift.conf-gluster swift.conf

You can check and modify these config files according to your setup , else these are good for all in one kind of setup.

6.Get a mock gluster volume(you can skip this if you already have one)

#dd if=/dev/zero of=~/myFileSystem.img bs=1024 count=1048576

#mkfs.xfs -f -n size=8192 -d su=256k,sw=10 myFileSystem.img

#mkdir -p /mnt/gbrick1

#mount -o loop ~/myFileSystem.img /mnt/gbrick1/

#IP=`ip addr show eth0 |grep 'inet ' | awk '{print $2}'| cut -d '/' -f1`

#gluster volume create test $IP:/mnt/gbrick1/

#gluster volume start test

C.Config on G4S-node to work with keystone

1.Assuming you have a gluster volume named ‘test’.You need to mount it on /mnt/gluster- object/tenant-id . Use the tenant-id generated in Section A-12.

#mkdir -p /mnt/gluster-object/b5815b046cfe47bb891a7b64119e7f80

#mount -t glusterfs IP_WHERE_GLUSTER_VOL_HOSTED:b5815b046cfe47bb891a7b64119e7f80 /mnt/gluster-object/b5815b046cfe47bb891a7b64119e7f80

2.Config the /etc/swift/proxy-server.conf for keystone:-

  a.)Modify the pipeline as follows:-

pipeline = catch_errors healthcheck proxy-logging cache authtoken keystoneauth proxy-logging proxy-server 

  b.)Add a authtoken filter with details of ur auth setup:-

     paste.filter_factory = keystoneclient.middleware.auth_token:filter_factory
     signing_dir = /etc/swift/signing_dir
     auth_host = kshost_ip
     auth_port = 35357
     auth_protocol = http
     service_host = g4snode_ip
     service_port = 8080
     admin_token = 012345SECRET99TOKEN012345

  c.)Config keystoneauth filter:-

     use = egg:swift#keystoneauth
     operator_roles = admin
     is_admin = true
     cache = swift.cache

3.Creating ring files for the mounted volume:-

#gluster-swift-gen-builders b5815b046cfe47bb891a7b64119e7f80

(b5815b046cfe47bb891a7b64119e7f80 is the tenant ID)

4.Start all the swift services on G4S node

#swift-init main start

5.You are all set for the testing you can uplaod a file(install.log) to a new dir called ‘dir’.

#swift -V 2.0 -A http://kshost:5000/v2.0 -U test:tester -K testing upload dir install.log