This page tracks the progress on the 2007 Google Summer of Code project by Adam Pflug to develop a new search module for Gallery2. The current implementation of Gallery2's search functionality suffers from poor performance, a problem particularly pronounced for large datasets because indexes cannot be used to improve efficiency. The solution to this performance and scalability problem is to use DBMS agnostic full text indexing for searches. Furthermore, the current search implementation does not support complex queries (such as queries containing boolean operators) or effectively support even basic relevance-based ranking algorithms. The use of a full text index and an enhanced query parser could offer improvements in both these areas.
Completed items in bold.
The new search system is considerably more complex, and relying on other modules to index and search their own data (as is required with the current search API) is no longer realistic. Instead, the Advanced Search module takes over much of the responsibility for indexing and searching tasks. The new Advanced Search API will focus not on having other modules provide actual search results, but instead keeping the Advanced Search module informed about changes to their data.
Modules that implement this new API will have a few key features. They will implement the new search interface, which will include a method for retrieving what fields should be indexed and a method for retrieving all entities in the database that should be indexed for that module (for use in reindexing). In addition, this interface will include a method for formating a search result that was returned because of matching data in this module. Finally, modules will be responsible for sending events whenever new data is created, updated, or deleted, so that the index may be updated accordingly. Other than these basic requirements, modules will not be
The simplest way to increase the performance of full text searches in gallery would be to take advantage of MySQL’s built in full text indexing support. This would result in much faster search performance, especially for large datasets, at the expense of slightly slower insertion times for new photographs. Unfortunately, full text indexes are not directly supported by many other database management systems. This means such an implementation would tie the search functionality to MySQL, which is unacceptable in light of Gallery2’s DBMS agnostic stance. Further complicating matters is the fact only the MyISAM storage engine for MySQL supports full text indexes. This is significant because it forces a choice between the full text indexes of MyISAM and some of the more advanced transaction and referential integrity support available with other storage engines, such as innoDB, which may be desirable in the future.
A more realistic solution would be to build and maintain an inverted index of the full text to be search. This would allow for the same performance gains as the MyISAM dependant solution, but with an implementation that is friendly to other database systems. Support for word stemming could be included to help limit the number of terms in the index while simultaneously increasing recall, an important function in small corpora. Maintaining a full text index of this type also would allow for more powerful query syntaxes, including support for boolean statements and qualifiers. For example the search “intitle: travel OR vacation” might match all instances where the title included either the word “travel” or the word “vacation”. Furthermore, this implementation would allow for more advanced methods of ranking search results based on query relevance.
Such implementations of full text indexes are already in use in other open source projects. The Zend Technologies has ported the Apache Foundation’s Lucene search project to PHP, and phpBB has long used full text indexes for their search implementation. These projects and others could serve as guides for the development of improved search functionality for Gallery2.
The basic search system would be made up of several components: the index (in the database), the indexer, the query parser, the search engine itself, and a user interface. The design of an interface to the search functionality would be relatively trivial, so I’m going to focus mostly on the other components.
The index would be responsible for storing information about which items contain what words in a format that is efficient for information retrieval. It would be stored in a number of database tables optimized to maximize performance. The two most important parts of the index would be a dictionary, which would store all the terms used in corpus, and a hit list, which would contain links between the dictionary and the items those terms appear in. The hit list would also contain information about the location and number of occurrences of a term in each item. For example one hit in the hit list might specify that the word “travel” was appeared once in the filename and twice in the description of the photo with the id 5. There are a number of different options for the exact schema for these tables that would have different performance tradeoffs that could be evaluated over the course of the project.
The indexer would be hooked into existing gallery code that added, deleted, or updated photos and would responsible both for creating the index for existing items (in cases where the index must built or rebuilt from scratch – for example during an upgrade of gallery or installation of the search module) and for ensuring that index stays up to date as the gallery changes. In order to add an item into the index, there are a number of tasks that the indexer must complete. First, if a field being indexed contained any kind of special semantics (for example HTML), that information would need to be stripped out. Next the indexer would remove punctuation and other symbols from the fields to be indexed and break them up into a list of words. After that the words in this list would be transformed to use a standard case (e.g. all lowercase letters) to make string comparisons more efficient. The list would then be filtered to remove some common words (called stop words) such as articles (the, a, an) that do not reflect what a field is about which would take a significant amount of space in the database to store. At this point a stemming algorithm could optionally be run on the list of words to group different forms of the same term (jump, jumped, jumping) together. After that any new terms in the item being indexed could be added to the dictionary, and finally this the list of terms would be transformed into a hit list that could be stored in the index.
The query parser would perform many of the same tasks as the indexer to translate the user’s search query into an internal representation that could be manipulated more easily and be used to query the database. Like the indexer it would take care of removing invalid symbols, splitting the string into words, standardizing the formatting of those words, removing stop words, and possibly stemming the search terms (if stemming was used during indexing). Additionally however, the query parser would be responsible for parsing any special commands from query string that represented instructions about how the search ought to be conducted. These commands might include things like Boolean operators or field restrictions.
After the query is parsed it would be handed to the search engine which would transform the query into a set of SQL commands to search the index for matching items. The engine would then retrieve the relevant items from the database for further analysis and/or display. Optionally, the engine might also apply a ranking algorithm to the result set before passing it out to the client code. This algorithm could vary greatly and some testing would be required to identify the optimal balance of many considerations. That said, a short list of some of the factors that might be used to rank results by relevance includes: the number of times each term in the query occurs in the corpus, the number of times any term occurs in the corpus, the number of times each term in the query occurs in each item, the number of terms in each item, the date an item was taken and/or posted, the location of terms within the item, etc.
Regarding localization of the search functionality, there are two pieces of this feature that are directly affected by the language in use: the stemming algorithms and the list of stop words. Stemming algorithms are inherently language dependant, and are fairly complex to implement. One of the most commonly used stemming algorithms for the English language is the Porter stemming algorithm. Open source implementations of this algorithm natively in php are freely available, which makes inclusion of this functionality for the English galleries fairly trivial. Unfortunately however while open source algorithms exist for many (but obviously not all) other languages, they are typically implemented in C or Java. One of the most popular collections of these algorithms, the Snowball project is implemented in C but also exists as a PHP extension. Unfortunately this solution may be undesirable because of the increased system requirements and portability concerns.
This results in an interesting dilemma, not all languages have stemming algorithms, and for many of those that do installation could be complicated. One option would be to use stemming only for English galleries. Perhaps a better way would be to provide a standard stemmer interface so that modules for stemming various languages could be installed separately, though I’m not quite sure how the exact implementation details of this would work within Gallery’s module framework. Regardless, this would effectively make stemming an optional feature that could be installed or uninstalled by advanced users, and that could be developed and maintained separately from the core search functionality. Unfortunately no matter what the implementation of the actual algorithms, stemming could not be used in galleries in which multiple languages are in use simultaneously unless a different index was maintained for each language. Even in that case however, the index to search in would have to be specified by the user along with their search query.
The other potential localization problem lies with the usage of stop words. These words would be different for each language but, unlike the rules for stemming algorithms, the generation of these lists (especially when based on a list in another language) would not be particularly difficult for a fluent speaker. Likewise, there is probably not a significant problem with collisions between multiple stop word filters, because stop words in one language are not likely to be terribly missed in for searches in another language. Finally, the worst case scenario for localization efforts of stop words would be the simple omission of a stop word list for a language. The result of this would be that those words would be indexed; creating a larger index and possibly reducing search precision and performance slightly. This is certainly not a desirable situation, but the consequences are far from crippling to the search tool’s overall usability.
I am myself a user of Gallery2, and as such I have a stake in the continued development of the software. I believe that I am extremely well qualified to give back to the community with this project because of both my academic background and professional experience. My major, Informatics, at the University of Washington focuses on using technology to facilitate access to information. As such, my coursework includes many classes that are relevant to this project, including classes on advanced database design, search engine design, and the design of information systems. I have augmented this theoretical groundwork with practical experience developing search engines in both PHP and Java/JSP, including the implementation of full text indexes. I have years of experience programming in a variety of languages and on a number of different platforms, and have worked with multiple database management systems including MySQL, PostgreSQL, and MSQL. I am also familiar with object oriented development in both PHP 4 and PHP 5. As a result, I believe I am well qualified to write code that is compatible with the many different configurations in which Gallery2 is run.