Importing lots of data into Hippo Repository

Published on 27 Jul 2008

As I’m writing this blog, my laptop at Hippo is working hard. It’s importing meta data from a web service into our Slide/WebDAV based Hippo Repository. At this moment the web service contains over 270,000 entries. As a first real performance test, I’m trying to import them all, both in preview and in live. I should end up with more than 540,000 documents in the repository.

The web service serves the items in a paged Atom feed. Our import application uses ROME and the ROME OpenSearch module to parse the feeds. Each entry has a URI to get its meta data as XML. This XML is copied to Hippo Repository using the API of Hippo Repository Java Adapter.

I’m really curious how the repository will perform on my laptop when it contains more than 540,000 documents. My local repository uses a MySQL database for its storage. I haven’t changed the memory settings for the repository or the database yet. At this moment it has processed over 43400 entries using 0.9 GB of storage. Although most entries are only 2-3 kB each, some are up to 10 kB. Not all of that data is necessary for the website we’re building. A average decrease of only 1 kB per entry means we use 550-600 MB less storage after importing all entries. Processing the XML slows down the import but it can save a lot of storage.

Creating such amount of document forces you to think better about extracting parts of the content into properties. One unused WebDAV property means 540,000 unnecessary, indexed records in the database. Fortunately there’s enough time after my initial import to further optimise the application.

Update: seems like I reached some limit while writing this post. I increased the memory for the repository from 128 to 256 MB and changed the indexing cron job from every 5 seconds to every 5 minutes. This speeds up a lot when you’re adding much content. Because of a bug in my code, too many documents ended up in the same folder. Slide doesn’t really like that and I fixed the bug. Now it’s updating fast.