Compare Triple Loading

Comparing RDF API

From python I have several possible APIs for storing and querying RDF data. I've been using Redland as their API was reasonably stable and I could easily run it on my servers. However when I started bulk loading a 30,000 triple dataset into a hash database I noticed it taking an annoyingly long time, and thought I should see if there was something I could do to improve things.

Redland provides a C API and a set of command line tools for working with RDF data.

RDFLib is a pure python library. The 3.0 series seemed too hard to install. However the 4.0 series installed in a usable configuration with a quick pip install rdflib

Soprano is a KDE specific technology so is unlikely to be useful to most others. The nice thing about Soprano is that it's appears to be able to start OpenLink Virtuoso as the current user. I'm using the PyKDE4 wrappers for accessing it.

These tests were run on a thinkpad x220. I was also doing other light work at the same time. You should only pay attention to the shapes of the curves. The import question really is what order operation is inserting new triples.

Table of Contents

In [1]:
import os
import json
import shutil
import sys
import tempfile
import time
import uuid

%pylab inline
Populating the interactive namespace from numpy and matplotlib

Loading Data

For my real project, I'm grabbing some json from a web app, and using PyLD converting it into a JSON serialization of RDF. However for this benchmarking, that's a bit of a distraction. So to simplify this notebook and to save on their bandwidth I'm just going to load the dataset from a file.

In [2]:
with open(os.getcwd() + '/experiments.json', 'r') as instream:
    data = instream.read()
    experiments = json.loads(data)
In [3]:
for k in experiments:
    print '{}: {}'.format(k, len(experiments[k]))
http://submit.encodedcc.org/experiments/: 32509
@default: 20

A few records from the dataset so you can see what it looks like

In [4]:
experiments['http://submit.encodedcc.org/experiments/'][:5]
Out[4]:
[{u'object': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/profiles/experiment.json#experiment'},
  u'predicate': {u'type': u'IRI',
   u'value': u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'},
  u'subject': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/experiments/ENCSR000AAA/'}},
 {u'object': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/profiles/experiment.json#item'},
  u'predicate': {u'type': u'IRI',
   u'value': u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'},
  u'subject': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/experiments/ENCSR000AAA/'}},
 {u'object': {u'datatype': u'http://www.w3.org/2001/XMLSchema#string',
   u'type': u'literal',
   u'value': u'ENCSR000AAA'},
  u'predicate': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/profiles/experiment.json#accession'},
  u'subject': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/experiments/ENCSR000AAA/'}},
 {u'object': {u'datatype': u'http://www.w3.org/2001/XMLSchema#string',
   u'type': u'literal',
   u'value': u'RNA-seq'},
  u'predicate': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/profiles/experiment.json#assay_term_name'},
  u'subject': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/experiments/ENCSR000AAA/'}},
 {u'object': {u'datatype': u'http://www.w3.org/2001/XMLSchema#string',
   u'type': u'literal',
   u'value': u'ENCODE2'},
  u'predicate': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/profiles/experiment.json#award.rfa'},
  u'subject': {u'type': u'IRI',
   u'value': u'http://submit.encodedcc.org/experiments/ENCSR000AAA/'}}]

Top

Timing Cache

Since some of these loaders are slow, I felt I should save the times between times updating this notebook.

In [5]:
import shelve
timing = shelve.open('rdf-timing')

Timing Triple Loader

This is a generic loader, it stores a progress timestamp every 1%, and prints a progress report every 10%.

By passing in functions for the data-conversion and model adding functions I can reuse it for all the triple stores, and even basic python operations.

It returns a parallel array of the loaded index and the timestamp from the base $t_{zero}$ time.

In [6]:
def loadModel(graph, toNode, toStatement, Add, Commit=None):
    """load a graph into model.
    
    toNode is your conversion function
    toStatement is your statement maker
    Add - is the model load function e.g. model.addStatement
    Commit - Some APIs support a post loading 'save' function that delays
             building an index until this is run.
    """
    tzero = time.time()
    time_previous = tzero
    time_now = tzero
    index_previous = 0
    l = len(graph)
    onepercent = int(.01 * l)
    tenpercent = int(.1 * l)
    timestamps = [[] , []]
    print 'length: {}'.format(l)
    for i, triple in enumerate(graph):
        s = toNode(triple['subject'])
        p = toNode(triple['predicate'])
        o = toNode(triple['object'])
        Add(toStatement(s, p, o))
        if i > 1 and  i % onepercent == 0:
            time_now = time.time()
            if i > 1 and i % tenpercent == 0:
                print '{} of {} {}. {} {}'.format(i, l, time_now-tzero, i-index_previous, time_now-time_previous)
            timestamps[0].append(i)
            timestamps[1].append(time_now-tzero)
            time_previous = time_now
            index_previous = i
    if Commit:
        Commit()
        timestamps[0].append(i+1)
        timestamps[1].append(time.time() - tzero)
        
    return timestamps

To illustrate using it, here's the simple case using python functions that most closely map to the triple loader functionality.

In [7]:
def PythonLoader():
    """This uses python functions for the timing loader.
    """
    model = []
    timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                        str,
                        lambda x,y,z: (x,y,z),
                        model.append)
    return timing
                                        
timing['pure-python'] = PythonLoader()
length: 32509
3250 of 32509 0.0269148349762. 325 0.00243592262268
6500 of 32509 0.0505599975586. 325 0.00211310386658
9750 of 32509 0.0820269584656. 325 0.00240397453308
13000 of 32509 0.102426052094. 325 0.00202918052673
16250 of 32509 0.12731385231. 325 0.00392389297485
19500 of 32509 0.151532888412. 325 0.00185894966125
22750 of 32509 0.172271966934. 325 0.00218796730042
26000 of 32509 0.196752071381. 325 0.00181913375854
29250 of 32509 0.222074985504. 325 0.00187706947327
32500 of 32509 0.240712881088. 325 0.00181102752686

Top

API Specific Code

Redland

In [8]:
import RDF
In [9]:
def pyldToRedlandNode(item):
    """JSON-node to Redland RDF Node
    """
    nodetype = item['type']
    value = item['value']
    datatype = item.get('datatype', None)

    if nodetype == 'blank node':
        return RDF.Node(blank=value)
    elif nodetype == 'IRI':
        return RDF.Node(uri_string=str(value))
    else:
        return RDF.Node(literal=unicode(value).encode('utf-8'), 
                        datatype=RDF.Uri(datatype))
In [10]:
RDFS = RDF.NS("http://www.w3.org/2000/01/rdf-schema#")
s = pyldToRedlandNode({'type': 'blank node', 'value': ''})
p = pyldToRedlandNode({'type': 'IRI', 'value': RDFS['label']})
o = pyldToRedlandNode({'type': 'literal', 'value': 7, 'datatype': "http://www.w3.org/2001/XMLSchema#integer"})

assert s.is_blank()
assert p.is_resource()
assert o.is_literal()
assert o.literal_value['string'] == '7'

Top

Redland Timing Tests

Redland Null

In [11]:
def testRedlandPython():
    """This uses python functions for the timing loader.
    
    Thus gives a minimal baseline of how type conversion takes
    """
    model = []
    timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                        pyldToRedlandNode,
                        lambda x,y,z: (x,y,z),
                        model.append)
    return timing
                                        
timing['redland-python'] = testRedlandPython()
length: 32509
3250 of 32509 0.119760036469. 325 0.046886920929
6500 of 32509 0.198075056076. 325 0.00653505325317
9750 of 32509 0.266803979874. 325 0.00576901435852
13000 of 32509 0.328070163727. 325 0.00610303878784
16250 of 32509 0.389616012573. 325 0.00576281547546
19500 of 32509 0.50829410553. 325 0.00659990310669
22750 of 32509 0.571217060089. 325 0.00665307044983
26000 of 32509 0.632225036621. 325 0.00585603713989
29250 of 32509 0.69336104393. 325 0.00584888458252
32500 of 32509 0.808782100677. 325 0.00580906867981

Top

Redland Memory

In [12]:
def testRedlandMemory():
    """Constructs an in-memory Redland RDF model
    """
    storage = RDF.MemoryStorage()
    model = RDF.Model(storage)
    timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                        pyldToRedlandNode,
                        RDF.Statement,
                        model.add_statement)
    return timing
                                        
timing['redland-memory'] = testRedlandMemory()        
length: 32509
3250 of 32509 0.263481855392. 325 0.0374739170074
6500 of 32509 0.802514076233. 325 0.0692582130432
9750 of 32509 1.75570607185. 325 0.125497102737
13000 of 32509 3.68864703178. 325 0.249485969543
16250 of 32509 6.57915091515. 325 0.317499876022
19500 of 32509 10.2026870251. 325 0.398415088654
22750 of 32509 14.6042349339. 325 0.465913057327
26000 of 32509 19.7509899139. 325 0.533140897751
29250 of 32509 25.5253880024. 325 0.59960603714
32500 of 32509 31.9292008877. 325 0.660928010941

Redland Hash

In [13]:
def testRedlandHash():
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    try:
        options = "contexts='yes',hash-type='bdb',dir='{0}'".format(storage_dir)
        storage = RDF.HashStorage('redland-test', options=options)
        model = RDF.Model(storage)
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                            pyldToRedlandNode,
                            RDF.Statement,
                            model.add_statement)
    finally:
        print os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing

timing['redland-hash'] = testRedlandHash()
storage_dir: /tmp/test_model_LfZEmY
length: 32509
3250 of 32509 13.587677002. 325 2.58502912521
6500 of 32509 54.0975439548. 325 5.30358791351
9750 of 32509 122.25406003. 325 8.03873109818
13000 of 32509 217.757619858. 325 10.8126528263
16250 of 32509 341.260326862. 325 13.5416498184
19500 of 32509 492.162247896. 325 16.3071219921
22750 of 32509 672.341810942. 325 18.9963510036
26000 of 32509 881.394460917. 325 21.7362020016
29250 of 32509 1113.95988393. 325 24.4342980385
32500 of 32509 1374.27074885. 325 27.6178319454
['redland-test-sp2o.db', 'redland-test-po2s.db', 'redland-test-so2p.db', 'redland-test-contexts.db']

Ok that was painful let's try again, but without asking for the context graph.

In [14]:
def testRedlandHashContextFree():
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    try:
        options = "hash-type='bdb',dir='{0}'".format(storage_dir)
        storage = RDF.HashStorage('redland-test', options=options)
        model = RDF.Model(storage)
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                            pyldToRedlandNode,
                            RDF.Statement,
                            model.add_statement)
    finally:
        print os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing
                                        
timing['redland-hash-context-free'] = testRedlandHashContextFree()
storage_dir: /tmp/test_model_pcNxZT
length: 32509
3250 of 32509 0.149784088135. 325 0.0133171081543
6500 of 32509 0.282102108002. 325 0.012787103653
9750 of 32509 0.423465013504. 325 0.0127878189087
13000 of 32509 0.555802106857. 325 0.0126249790192
16250 of 32509 0.686717033386. 325 0.0130820274353
19500 of 32509 0.817808151245. 325 0.0128979682922
22750 of 32509 0.949244022369. 325 0.0128998756409
26000 of 32509 1.08057498932. 325 0.0130758285522
29250 of 32509 1.21215701103. 325 0.0130169391632
32500 of 32509 1.34427714348. 325 0.0131750106812
['redland-test-sp2o.db', 'redland-test-po2s.db', 'redland-test-so2p.db']

Wait a second, that's quite reasonable performance!

Top

Redland Sqlite

I tried talking with the developer of redland in their IRC channel, and after suggesting mysql was the most tested, he suggested trying the sqlite as well. It doesn't seem to have the $O(N^2)$ slow down problem the bdb hash storage was having, but its still slow.

In [15]:
def testRedlandSqliteFullContext():
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    try:
        options = "new='yes'"
        storage = RDF.Storage(storage_name='sqlite', 
                              name=os.path.join(storage_dir, 'redland-test'), 
                              options_string=options)
        context = RDF.Node(RDF.Uri('http://submit.encodedcc.org/experiments/'))
        model = RDF.Model(storage)
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'][:3000],
                            pyldToRedlandNode,
                            RDF.Statement,
                            model.add_statement)
    finally:
        print os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing
                          
timing['redland-sqlite-full-context'] = testRedlandSqliteFullContext()
#timing['redland-hash-full-context'] = 
storage_dir: /tmp/test_model_c4ihBd
length: 3000
300 of 3000 32.3209309578. 30 3.82048392296
600 of 3000 67.3903839588. 30 3.5882358551
900 of 3000 102.33176899. 30 3.22705793381
1200 of 3000 135.984196901. 30 3.3456029892
1500 of 3000 168.252134085. 30 3.14397311211
1800 of 3000 200.801614046. 30 3.46836805344
2100 of 3000 232.869807005. 30 3.21807599068
2400 of 3000 264.990459919. 30 3.05125379562
2700 of 3000 297.370455027. 30 3.27499008179
['redland-test']

Top

RDFLib

RDFLib API

In [16]:
from rdflib import ConjunctiveGraph
from rdflib import BNode, URIRef, Literal
In [17]:
def pyldToRDFLibNode(item):
    """Convert JSON-serialized node to a Soprano Node
    """
    nodetype = item['type']
    value = item['value']
    datatype = item.get('datatype', None)

    if nodetype == 'blank node':
        return BNode()
    elif nodetype == 'IRI':
        return URIRef(value)
    else:
        return Literal(value)

RDFLib timing test

In [18]:
def testRDFLibSleepycat():
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    try:
        graph = ConjunctiveGraph(store='Sleepycat')
        graph.open(storage_dir, create=True)
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                            pyldToRDFLibNode,
                            lambda x,y,z: (x,y,z),
                            graph.add)
    finally:
        print os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing


timing['rdflib-sleepycat'] = testRDFLibSleepycat()
storage_dir: /tmp/test_model_UeusSr
length: 32509
3250 of 32509 0.553234100342. 325 0.0490798950195
6500 of 32509 1.20439100266. 325 0.0664279460907
9750 of 32509 1.72844314575. 325 0.0503520965576
13000 of 32509 2.27863311768. 325 0.0666480064392
16250 of 32509 2.93121004105. 325 0.0548050403595
19500 of 32509 3.48354721069. 325 0.0513579845428
22750 of 32509 4.15174818039. 325 0.050968170166
26000 of 32509 4.68744206429. 325 0.0529808998108
29250 of 32509 5.2057390213. 325 0.0503079891205
32500 of 32509 5.72421002388. 325 0.0493898391724
['__db.001', '__db.002', 'contexts', 'prefix', 'k2i', '__db.003', 'c^o^s^p^', 'c^p^o^s^', 'i2k', 'c^s^p^o^', 'namespace']

5 seconds, that certainly seems usable

Top

Soprano

In [19]:
from PyKDE4.soprano import Soprano
from PyQt4.QtCore import QString, QUrl
In [20]:
def pyldToSopranoNode(item):
    """Convert JSON-serialized node to a Soprano Node
    """
    nodetype = item['type']
    value = item['value']
    datatype = item.get('datatype', None)

    if nodetype == 'blank node':
        return Soprano.Node.createBlankNode(str(uuid.uuid1()))
    elif nodetype == 'IRI':
        value = QUrl(value)
        return Soprano.Node.createResourceNode(value)
    else:
        return Soprano.Node.createLiteralNode(Soprano.LiteralValue(value))
In [21]:
s = pyldToSopranoNode({'type': 'blank node', 'value': ''})
p = pyldToSopranoNode({'type': 'IRI', 'value': Soprano.Vocabulary.RDFS.label()})
o = pyldToSopranoNode({'type': 'literal', 'value': 7})

assert s.isBlank()
assert p.isResource()
assert o.isLiteral()
assert o.literal().toInt() == 7

Top

Soprano Timing Tests

Soprano Null

In [22]:
def testSopranoList():
    """This uses python functions for the timing loader.
    
    Thus gives a minimal baseline of how type conversion takes
    """
    model = []
    timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                        pyldToSopranoNode,
                        lambda x,y,z: (x,y,z),
                        model.append)
    return timing
                                        
timing['soprano-list'] = testSopranoList()
length: 32509
3250 of 32509 0.104168891907. 325 0.00889587402344
6500 of 32509 0.186026096344. 325 0.00798916816711
9750 of 32509 0.268169879913. 325 0.00783586502075
13000 of 32509 0.34681892395. 325 0.00748300552368
16250 of 32509 0.482909917831. 325 0.00828099250793
19500 of 32509 0.571913957596. 325 0.00871801376343
22750 of 32509 0.704832077026. 325 0.00786399841309
26000 of 32509 0.783478975296. 325 0.00756597518921
29250 of 32509 0.863605976105. 325 0.00776886940002
32500 of 32509 0.952097892761. 325 0.00988984107971

Top

Soprano in-memory

In [23]:
def testSopranoDefault():
    """Default Soprano model.
    
    It's behavior suggests it's probably using Redland
    bsddb hash storage.
    """
    model = Soprano.createModel()
    assert model
    timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                        pyldToSopranoNode,
                        Soprano.Statement,
                        model.addStatement)
    return timing
                                        
timing['soprano-default'] = testSopranoDefault()        
length: 32509
3250 of 32509 9.95289301872. 325 1.74399209023
6500 of 32509 37.6105718613. 325 3.73698282242
9750 of 32509 87.5574810505. 325 6.22128415108
13000 of 32509 159.976824045. 325 8.16315102577
16250 of 32509 256.550194025. 325 10.9782390594
19500 of 32509 374.274512053. 325 12.6249821186
22750 of 32509 516.382483006. 325 15.3015880585
26000 of 32509 676.982827902. 325 16.7248778343
29250 of 32509 855.201326847. 325 18.8108589649
32500 of 32509 1053.39780402. 325 20.2996370792

Top

Soprano Hash

In [24]:
def testSopranoStorage():
    """Soprano with an explicit storage location.
    
    Again, probably a redland hash storage model. Also 
    so painfully slow, I trimmed the number of datapoints.
    """
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    storage_setting = Soprano.BackendSetting(Soprano.BackendOptionStorageDir, storage_dir)

    try:
        model = Soprano.createModel([storage_setting])
        assert model
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'][:3000],
                            pyldToSopranoNode,
                            Soprano.Statement,
                            model.addStatement)
        
    finally:
        print os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing
                                        
timing['soprano-hash'] = testSopranoStorage()        
storage_dir: /tmp/test_model_3PTwQg
length: 3000
300 of 3000 13.3298258781. 30 1.58448195457
600 of 3000 28.6537108421. 30 1.54429578781
900 of 3000 43.8913290501. 30 1.53432202339
1200 of 3000 59.5767500401. 30 1.51763105392
1500 of 3000 76.5881268978. 30 1.88469982147
1800 of 3000 92.7581138611. 30 1.43498492241
2100 of 3000 110.605937958. 30 1.7611579895
2400 of 3000 128.537503958. 30 1.76784992218
2700 of 3000 147.209956884. 30 1.95115303993
['soprano-contexts.db', 'soprano-so2p.db', 'soprano-po2s.db', 'soprano-sp2o.db']
In [25]:
def testSopranoAddStatementsHash():
    """Soprano does have a "addStatements" function. 
    
    Does that do anything to avoid building an index as data is loaded?
    """
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    storage_setting = Soprano.BackendSetting(Soprano.BackendOptionStorageDir, storage_dir)

    try:
        model = Soprano.createModel([storage_setting])
        assert model
        statements = []
        def commit(model=model, statements=statements):
            print 'wait for it'
            model.addStatements(statements)
            print 'done'
            
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                            pyldToSopranoNode,
                            Soprano.Statement,
                            statements.append,
                            commit)        
    finally:
        print 'dir contents', os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing
                                        
timing['soprano-hash-commit'] = testSopranoAddStatementsHash()
storage_dir: /tmp/test_model_w1ONme
length: 32509
3250 of 32509 0.111657857895. 325 0.00961089134216
6500 of 32509 0.195324897766. 325 0.00780081748962
9750 of 32509 0.278141021729. 325 0.00815916061401
13000 of 32509 0.358635902405. 325 0.00805902481079
16250 of 32509 0.439920902252. 325 0.00788497924805
19500 of 32509 0.573112010956. 325 0.00857996940613
22750 of 32509 0.661998987198. 325 0.00883913040161
26000 of 32509 0.749497890472. 325 0.00797486305237
29250 of 32509 0.834306001663. 325 0.00850892066956
32500 of 32509 0.918995857239. 325 0.00848293304443
wait for it
done
dir contents ['soprano-contexts.db', 'soprano-so2p.db', 'soprano-po2s.db', 'soprano-sp2o.db']

Top

Soprano Virtuoso

In [26]:
def testSopranoVirtuoso():
    """Virtuoso loader. 
    
    Note, it appears you if specify a storage directory it will create database
    files in that directory. Convienent if you don't want to mess up your
    desktop database)
    """
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    storage_setting = Soprano.BackendSetting(Soprano.BackendOptionStorageDir, storage_dir)

    try:
        virtuoso = Soprano.discoverBackendByName("virtuosobackend")
        assert virtuoso
        model = virtuoso.createModel([storage_setting])
        assert model
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                            pyldToSopranoNode,
                            Soprano.Statement,
                            model.addStatement)
    finally:
        print os.listdir(storage_dir)
        shutil.rmtree(storage_dir)
    return timing
                
timing['soprano-virtuso'] = testSopranoVirtuoso()    
storage_dir: /tmp/test_model_ebZ59I
length: 32509
3250 of 32509 6.93688797951. 325 0.799103975296
6500 of 32509 15.1029601097. 325 0.874695062637
9750 of 32509 23.2184431553. 325 0.804484128952
13000 of 32509 31.4607131481. 325 0.79851102829
16250 of 32509 39.6621539593. 325 0.808573007584
19500 of 32509 47.9102070332. 325 0.807959079742
22750 of 32509 55.9945321083. 325 0.796631097794
26000 of 32509 64.0528371334. 325 0.79683303833
29250 of 32509 72.3874390125. 325 0.853322982788
32500 of 32509 81.5685091019. 325 0.824387073517
['soprano-virtuoso.pxa', 'soprano-virtuoso.trx', 'soprano-virtuoso-temp.db', 'soprano-virtuoso.log', 'soprano-virtuoso.db', 'soprano-virtuoso.lck', 'soprano-virtuoso.lock']

Top

Relational SQLite3

After all those loads, I was curious what loading the data into a simple relational sqlite table would be like. To make it a bit more fair, I created an index on the the columns.

In [27]:
import sqlite3
In [28]:
def testSqlite():
    storage_dir = tempfile.mkdtemp(prefix='test_model_')
    print 'storage_dir: {}'.format(storage_dir)
    try:
        conn = sqlite3.connect(os.path.join(storage_dir, 'triple.db'))
        cursor = conn.cursor()
        cursor.execute('create table triple ( subject text, predicate text, object text);')
        cursor.execute('create index subject_idx on triple (subject);')
        cursor.execute('create index predicate_idx on triple (predicate);')
        cursor.execute('create index object_idx on triple (object);')
        def statement(s, p, o):
            return s, p, o
        def addSqlite(stmt):
            cursor.execute('insert into triple(subject, predicate, object) values (?, ?, ?)', stmt)
        timing = loadModel(experiments['http://submit.encodedcc.org/experiments/'],
                            str,
                            statement,
                            addSqlite,
                            conn.commit)
    finally:
        print os.listdir(storage_dir)
        #shutil.rmtree(storage_dir)
    return timing

timing['sqlite'] = testSqlite()   
storage_dir: /tmp/test_model_kMWqkx
length: 32509
3250 of 32509 0.177853107452. 325 0.00862598419189
6500 of 32509 0.262326955795. 325 0.00856280326843
9750 of 32509 0.348209142685. 325 0.00775718688965
13000 of 32509 0.431903123856. 325 0.00786399841309
16250 of 32509 0.520929098129. 325 0.00949311256409
19500 of 32509 0.604700088501. 325 0.0078821182251
22750 of 32509 0.694653987885. 325 0.00804400444031
26000 of 32509 0.776463985443. 325 0.00800490379333
29250 of 32509 0.860121965408. 325 0.00809383392334
32500 of 32509 0.942430019379. 325 0.00813603401184
['triple.db']

Top

Plots

All Triple Stores

This is the full plot showing all the triple stores. The times are in seconds. The wall at the end is because in one case I wonderedi if using Soprano's addStatements function might get around the $O(N^2)$ index building. Apparently it didnt.

In [29]:
figure(figsize=(10, 8))
for k in timing:
    plot(timing[k][0], timing[k][1], label=k)
title('All Triple Stores')
xlabel('records loaded')
ylabel('time (s)')
legend(loc=0)
Out[29]:

Sub-100 second loads

In [30]:
figure(figsize=(10, 8))
for k in timing:
    if timing[k][1][-1] < 100:
        plot(timing[k][0], timing[k][1], label=k + ' ({:.2f} s)'.format(timing[k][1][-1]))
title('Fast Triple Stores')
xlabel('records loaded')
ylabel('time (s)')
legend(loc=0)
Out[30]:

Sub-10 second loads

In [31]:
figure(figsize=(10, 8))
for k in timing:
    if timing[k][1][-1] < 10:
        plot(timing[k][0], timing[k][1], label=k + ' ({:.2f} s)'.format(timing[k][1][-1]))
title('Fast Triple Stores')
xlabel('records loaded')
ylabel('time (s)')
legend(loc=0)
Out[31]:

Discussion

It appears there are a lot of utterly terrible triple stores.

UPDATE: Apparently adding the context graph is really expensive with my version of Redland

Assuming my implementation was right, Redland memory and Soprano Virtuoso performed similarly on my collections of triples. However the curve does suggest that Soprano Virtuoso will perform better on more realisitic triple databases.

Also RDFLib has dome some pretty impressive magic to get their loading so fast.

Addendum

Complexity plots

just to remind myself I wanted to see what the difference between, $O(N)$, $O(N log N)$ and $O(N^2)$ looked like.

In [34]:
plot(range(100), range(100), 'k', label='$O(N)$')
plot(range(100), [ x * log(x) for x in range(100)], 'b', label='$O(N log N)$')
plot(range(100), [ x * x for x in range(100)], 'r', label='$O(N^2)$')
legend(loc=0)
Out[34]: