Skip to content
Print

Is a Separate Text Search Engine a Bad Idea?

11
Oct
2007

I was reading this blog entry a few days ago, and it started me thinking about full-text searching.  That wasn’t the main topic of the post, but I think the little side-trek into the field was interesting enough to merit some thought.  Right smack in the middle, Jamie goes on a bit of a rant about the pain of what is effectively two, separate databases (for example, MySQL and Lucene):

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

There is some validity to this thought.  After all, duplication in software usually means you’re doing something wrong – or at least, there could be an easier way.  Even ignoring this precept, it’s just common sense that keeping data synchronized concurrently between two data sources as complex as a relational database and a full-text index is not an easy task.  Granted, some ORMs can handle this task for you (actually, I can only think of Hibernate and ActiveObjects having this feature), but the principle is the same.  And even if everything is neatly and auto-magically synced, there’s always a danger of something getting out of place, and then you’re stuck with a transient stale data issue that’s difficult to track down.

The author of the post mentions that he favors the full-text search capabilities of PostgreSQL, the popular open-source database and competitor to MySQL.  This does have the advantage that you’re putting all the data in one place, handling everything with a single query language (SQL), and reducing the technologies your software depends upon.  This inarguably makes things a whole lot easier.

The main problem as I see it is this is putting a ton of unnecessary strain on the database.  In most modern server-side applications, the bottleneck is in the database (usually caused by too much badly written SQL).  There are whole mountains of documentation which offers suggestions on how to alleviate this problem.  Indexes, database clustering and a carefully chosen ORM can go a long way.  Unfortunately, tacking on full-text indexing seems like a step in the wrong direction.

Lucene is very good at what it does.  It’s indexing and storage performance is second to none.  In fact, it’s so fast that a lot of companies use it as a quick-and-dirty storage dumping ground for raw data, knowing that it will be much faster and more scalable than a relational database.  Why not take advantage of this incredible power and take one more item off of your database’s back?  This is all not to mention the fact that a Lucene index query is probably a lot faster than an SQL query grabbing data from a PostgreSQL full-text index.

So what about the flip side of things?  Why not just put all the data into Lucene (or clone) and eschew relational databases altogether?  Well as I mentioned above, a lot of companies do this for simple things.  Lucene is fantastic at both scalability, and very fast indexing and querying of large blocks of text.  Where it begins to trip up is when you turn it loose on other data types.  Don’t get me wrong, Lucene is an amazing piece of technology.  But just like PostgreSQL isn’t a full-text search engine, Lucene isn’t an RDBMS.  Each component of the infrastructure needs to handle what it’s best at.  In fact, this is really a large aspect of scalability.  Ensuring that every technology is utilized to its fullest potential and no more is crucial to a high-volume application.

Final verdict?  I think I’m sticking with MySQL and Lucene working in tandem, each doing what they do best.  ActiveObjects makes the synchronization almost completely transparent, so it’s not like I’m loading myself down with unnecessary work from a code standpoint.  Seems like a good solution to me; and since most of the industry agrees, it’s probably a safe bet for you too.

Comments

  1. Using a lucene backend is great for a search engine, or even a site where the navigation involves browsing through mostly static content. It’s not strictly limited to searching, I know of a lot of shopping sites that are built on lucene.

    It’s pretty much useless for a site or section that needs transactional support, for instance you wouldn’t use it for a shopping cart page because it’s just not made for that type of thing. The index building functions are far less performant than you’d think, and you really require transactions for that type of data.

    I think that you have to design an application so that you can “plug in” a new search engine when MySQL fulltext starts to break down. For small applications MySQL will work perfectly (like this blog), but once you start to have a large volume of searching going on it simply can’t handle the load.

    That’s where it would be great to just “plug” a new search engine in that could utilize lucene.

    The How-To Geek Thursday, October 11, 2007 at 5:42 am
  2. use the best tool for the job. and sometimes this means more than one tool. unless the day comes when I can write an entire program in my RDMS, lol THEN I’ll be happy and singletonitized!

    owen Thursday, October 11, 2007 at 12:33 pm
  3. You are obviously using a relatively simple database + index configuration. One DB and one index ?

    Well, the world is bigger than that. When you have data *that changes frequently* spread across several hundreds of servers, some of it replicated in different clusters, keeping the Lucene index(es) *up to date and consistent* becomes a huge task. I know because I’m doing it right now and it sucks.

    Keeping the full text search engine outside of the database is the wrong way and market requirements will force RDBMS manufacturers to do a better job at integrating them while preserving performance and scalability.

    You're wrong Friday, October 12, 2007 at 1:19 am
  4. Well, I’ll admit my experience with that kind of system is limited, but I know of other people who do have the experience and they say that it’s often *easier* to keep the Lucene bits in sync than the DBMS.

    If you’re having difficulties keeping Lucene in sync with your database (or visa versa), you may want to take a hard look at how you’re doing things. Not just the underlying search-vs-database, but how you’re implementing them. As I mentioned, a few ORMs can keep search indexes and databases in sync automatically. Have you tried this approach? Even if you haven’t, a reasonable DAO should be able to take care of most of the pain by centralizing the Lucene update code.

    Lucene scales incredibly simply (since it’s usually just a bunch of files sitting on a hard drive). I honestly can’t see how a huge deployment would make it difficult to keep everything in sync. Maybe on the database side of life, but not with Lucene itself.

    Daniel Spiewak Friday, October 12, 2007 at 11:37 am
  5. There are cases where you would want to further filter or sort the result set from a full-text query by another parameter stored in a database. For instance, I search for articles stored in Lucene, but want to sort/rank them by user-given ratings, which are stored in a database. I don’t want to store the user ratings inside Lucene because that data changes a lot, and keeping the Lucene index updated is not easy. Multiple users cannot write to Lucene index concurrently. Another example: what if I want to sort the shopping cart items by price? So I think we need a single product which can handle both full-text and regular SQL queries and sort the results based on full-text relevance or some other db field. If only the database vendors take this seriously. I can see a huge demand for such an IR-DB product.

    Prathapan Sethu Tuesday, December 18, 2007 at 4:45 am
  6. That’s a good point. The problem I see is the same as I illustrated in the article: service overload. PostgreSQL does support full text indexing, but if you try to put both full-text indexing and databasing into a single point of failure, you’re going to have to deal with some serious scalability issues. I won’t deny that it’s possible to do things that way, but I don’t think it’s a terribly good idea for a high volume application.

    What would be really nice is some sort of integration between the search engine (Lucene) and the database (MySQL, PostgreSQL, it shouldn’t matter). This sort of integration could allow things like your use case, without overloading the database.

    Daniel Spiewak Tuesday, December 18, 2007 at 9:26 am

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*