Lily is based on Apache HBase and SOLR. We spent a long time thinking about these choices as they're fundamental as to how Lily will shape up. Here's what we learned.
In selecting a storage technology for Lily, we were initially looking for:
automatic scaling to large data sets, rather than e.g. a manual sharded SQL database setup.
fault-tolerance: replication, automatic handling of failing nodes
a flexible data model supporting sparse data
runs on commodity hardware
efficient random access to data
open source, ability to participate in the development and hence the direction of the project
some preference for a Java-based solution
Deepening our understanding of the different available NoSQL options, we learned we were also looking for:
consistency: no chance of having two conflicting versions of a row
atomic update of a single row, single-row transactions
bonus points for MapReduce integration
By finally choosing HBase, we also got:
a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model.
ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it. This can be used instead of Lucene for indexes which are structured, large, and should be immediately up to date. For example, we use this to keep an index of the links that exist between records.
HDFS, a convenient place to store large blobs
Apache license and community, a familiar environment for us.
For search, the choice for Lucene as core technology was pretty much a given. In Daisy, our previous CMS, we used Lucene only for full-text search and performed structural searches on the SQL database. We merged the results from those two different search technologies on the fly, supporting mixed structural and full-text queries. However, this merging, combined with other high-level features of Daisy, was not designed to handle very large data sets. For Lily, we decided that a better approach would be to perform all searching using one technology, Lucene.
A downside to Lucene is that index updates are only visible with some delay to searchers, though work is ongoing to improve this. At its heart it is a text-search library, though with its fielded documents and the trie-range queries, it handles more data-oriented queries quite well.
Lucene in itself is a library, not a standalone application, nor a scalable search solution. But all this can be built on top. The best known standalone search server on top of Lucene is SOLR, which we decided to use in Lily.
But before we made that choice, we considered a lot of the available options:
Katta. Katta provides a powerful scalable search model whereby each node is responsible for searching on a number of shards, replicas of the shards are present on multiple of the nodes. This provides scaling for both index size and number of users, and gracefully handles node failures since the shards that were on a failed node will be available online on some other nodes. However, Katta is only a search solution, not an indexing solution, and does not offer extra search features such as faceting.
Hadoop contrib/index. This is a MapReduce solution for building Lucene indexes. The nice thing about it is that the MR framework manages spreading the index building work over multiple nodes, reschedules failed jobs, and so on. It can also be used to update existing indexes. The number of index shards is determined by the number of reduce tasks. Hadoop contrib/index is an ideal complement to Katta. The downside is that it is inherently batch-oriented, which excludes profiting from the ongoing Lucene near-real time (NRT) work.
The tools from LinkedIn (blog). LinkedIn has made available some cool Lucene-related projects like Bobo, an optimized facet browser that does not rely on cached bitsets, and Zoie, a real-time index-search solution (built in a different way than what is available in Lucene 3). They are apparently integrating it all in Sensei. It is interesting to study the design of these projects.
ElasticSearch. ElasticSearch (ES) is a very new project, that appeared as a one-man project at about the same time we made our choice for SOLR. One can easily launch a number of ES nodes, they find themselves without configuration. Multiple indexes can be created using a simple REST API. When creating an index, you specify the number of shards and replicas you desire. It is designed to work on cloud computing solutions like EC2, where the local disk is only a temporary storage. There is a lot more to tell, but you can read that on their website. Despite the name 'elastic', it does not support indexes growing dynamically in the same way as tables can grow in HBase: the number of shards is fixed when creating the index. However, if you find yourself in need of more shards, you can create a new index with more shards and re-index your content into that. The number of shards is not related to the number of nodes, so you can plan for growth by choosing e.g. 10 shards even if you have just one or two nodes to start with.
Lucandra, Lucehbase and Hbasene. These projects work by storing the inverted index on top of Cassandra respectively HBase. The use of a database is quite different from Lucene's segment-based approach. While it makes the storage of the inverted index scalable, it does not necessarily make all of Lucene's functionality scalable, such as sorting and faceting which depend on the field caches and bitset-based filters. Moreover, for HBase, which we know best, the storage is not as scalable as it may seem, since terms are stored as rows and the postings lists (= the documents containing the term) as columns. Usually the number of terms in a corpus is relatively limited, while the number of documents can be huge, but columns in HBase do not scale in the same way as rows. We think the scaling (sharding, replication) needs to happen on the level of Lucene instances itself, rather than just the storage below it. Still, it is interesting to watch how these projects will evolve.
Building our own. Another option was to just take Lucene itself and build our own scalable search solution using it. In this case we would have gone for a Katta/ElasticSearch-like approach to sharding and replication, with a focus on the search features we are most interested in (such as faceting). However, we decided that this would take too much of our time.