Should bug search match target names?
We have a small quandry on the Launchpad development team at the moment. As bug 268508 discusses, when one searches for a bug on Launchpad we do a substring search on the names of bug targets.
For instance, searching in Ubuntu for ‘gcc’ will return all bugs on the packages ‘gcc’, ‘gcc-4.4’, ‘gcc-4.3’, ‘gcc-3.3’ and so forth. Likewise search for bugs in a project group will do a similar substring search on each of the individual projects in the project group.
It turns out that doing this search is itself expensive. I asked on the Ubuntu devel list about turning it off. We would close bug 268508 and also significantly improve search performance.
However this is a possibly contentious change – there was one mail strongly in favour of the current behaviour – so I’d like to get this change proposed to a wider community.
If you’ve got a strong opinion – that the current behaviour is good, or like bug 268508 describes, that its a poor behaviour and we would be better off without it, then I’d love to hear from you. Just leave a comment on this post, drop me an email – robert at canonical.com – or post to the launchpad-users mailing list.
Thanks,
Rob (LP technical architect)
Tags: bug, front-page
February 7th, 2011 at 1:27 am
It depends on your indexing technology. If you’re using Xapian, you could set up a field parser with a non-letter stemer or you could do work boundries as white spaces or dashes. You could even set up two fields both indexing all content but with different rules.
Again, it depends on your indexer. Though it searching for gcc is expensive, then you’re probably not using Xapian 🙁
February 7th, 2011 at 1:38 am
We’re not using xapian, but even if we were: the question is not /how do we make this fast/ – we have that sorted, with a effort estimate. The question is ‘delete (possibly temporarily) or fix’.
The data I have suggests that deleting it entirely is probably the most useful thing we could do.
February 7th, 2011 at 2:57 am
If this will stop even 50% of the timeouts currently experienced, it is worth it. We can all learn to add to search terms, but as the search exists, you can not limit the actual search pattern. Whether you search on one word or a 50-word phrase, you timeout because the search pattern is so big. It seems a shame that we must use google to find things in launchpad because it will search on the terms, rather than expanding the search to include all the items you are trying to keep out of the search.
February 7th, 2011 at 3:47 am
The obvious question to me was why am I searching for bugs. If I have a potential bug for a given application (gcc for example), I am likely searching for things that may be related. In which case, limiting the search to exact text matches will increase the probability that I would report a duplicate bug, as I won’t necessarily search for all permutations of gcc.
I think that there is really quite a limited subset of substring matching behaviour that is desirable (e.g., libpy should find libpy-3.4, libpy3.4, libpy34, but not libpython)… but if all forms of substring matching are removed, it makes it very difficult to find these potentially related bugs. To me, that is an ability that has no replacement if substring matching is removed
February 7th, 2011 at 7:02 am
@Robert – Remove substring searching, it’s a symptom.
You need to make sure your index parser is properly configured. Of course if you’re just using SQL, noSQL or flat text with regex… that might be why it’s so slow. And unless you’re using a billion records, the is no reason why an enterprise indexer/search like xapian, lucene etc could be slow on anything but a rubber band computer.
Perhaps I’d need to understand what tech you’re using first.
February 13th, 2011 at 2:51 am
I’d say prefix searching, or whole match searching is useful. I don’t see that substring matching of packages is particularly useful. There are things like “python-foo” where you certainly might only know the name of it as “foo”. But I don’t think searching for “pie” matching “guppie” is particularly useful. Then again, I also don’t think searching for “python” should match 1000 “python-foo” packages.
February 14th, 2011 at 9:21 am
+1 for dropping the substring search, at least for now, and until it is a setting. The biggest advantage I can see for substring matches is a slightly higher chance of avoiding duplicates when the user is not looking in the right place. This however comes at a very high cost, and is also counter-intuitive and counter-productive for most of us who use launchpad daily and know where to look.
I would take the performance improvement over the increased duplicates any day (I’m not even sure it can make such a difference about duplicates, but even if it does!).
Even going for prefix matching only, as suggested on the mailing-list, still seems counter-intuitive to me, and would need at least smarter safeguards, like being disabled for less than 4-5 letters or if the prefix is in the name of the current package (to avoid bug 268508 for instance).
And if any kind of implicit matching on package name is kept (or re-introduced later), I would ask one favor: please please make this a setting that can be turned off per-search or per-user, even when it won’t have any performance hit anymore.