Issue515

Title Full text search doesn't return results
Priority urgent Status chatting
Superseder Nosy List ezio.melotti, introom, lemburg, r.david.murray
Assigned To Topics

Created on 2013-04-08.14:03:36 by lemburg, last changed 2015-04-03.21:57:41 by introom.

Files
File name Uploaded Type Edit Remove
indexer.patch introom, 2015-04-03.21:57:41 text/plain
Messages
msg2724 (view) Author: lemburg Date: 2013-04-08.14:03:35
When using the search box at the top of the tracker, or only the "All text" field in the advanced search, the tracker doesn't return any results for e.g. "2.7.4", even though there are tickets which have this string in their title.

Example URLs:

Advanced search for "2.7.4" in the title:
http://bugs.python.org/issue?%40search_text=&ignore=file%3Acontent&title=2.7.4&%40columns=title&id=&%40columns=id&stage=&creation=&creator=&activity=&%40columns=activity&%40sort=activity&actor=&nosy=&type=&components=&versions=&dependencies=&assignee=&keywords=&priority=&%40group=priority&status=1&%40columns=status&resolution=&nosy_count=&message_count=&%40pagesize=50&%40startwith=0&%40queryname=&%40old-queryname=&%40action=search

(returns 2 tickets)

Advanced search for "2.7.4" in "All text":
http://bugs.python.org/issue?%40search_text=2.7.4&ignore=file%3Acontent&title=&%40columns=title&id=&%40columns=id&stage=&creation=&creator=&activity=&%40columns=activity&%40sort=activity&actor=&nosy=&type=&components=&versions=&dependencies=&assignee=&keywords=&priority=&%40group=priority&status=1&%40columns=status&resolution=&nosy_count=&message_count=&%40pagesize=50&%40startwith=0&%40queryname=&%40old-queryname=&%40action=search

(returns 0 tickets)

Search for "2.7.4" in the search box at the top:
http://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=2.7.4&submit=search&status=-1%2C1%2C2%2C3

(returns 0 tickets)

From reading some other tickets related to search problems, this may be caused by the full text search index needing a rebuild.
msg2725 (view) Author: r.david.murray Date: 2013-04-08.14:55:17
It looks like the function that does the splitting into words for indexing/searching is:

   re.findall(r'(?u)\b\w{2,25}\b', text)

which if I pass it '2.7.4' doesn't return any matches...that is, as usual for text indexes punctuation is ignored, and in addition single letter 'words' are ignored.  So I don't think simply rebuilding the index will help.
msg2958 (view) Author: introom Date: 2015-04-02.20:10:51
Hi.

JUST FYI.

When a user submits an issue, the tracker will split the both the title and message into words. The splitting regex is the same as provided by RdM.

Those splitted words are stored in the table _words. Each _words entry is in the form of (a_splitted_word, a_text_id).

A _text_id is the PM key of the table _textids. The _textids entry will tell us what's the associated issue title id or message id.

Thus, an inverted index is formed.

When a user searches, for example, through the web interface. The roundup.cgi.templating.py will re-split the words, search that inverted index, and finally find the result.


What I am thinking is, regex is not powerful enough. We may add fix for this by excluding "." from the word boundary.

But this may be (better) achieved by other means, like specifying the "version".
msg2961 (view) Author: introom Date: 2015-04-03.21:57:41
This patch splits text in the following form,

suppose TEXT = "aa bb ee.ff.gg"

the splitted words will be: aa, bb, ee, ff, gg, ee.ff, ff.gg, ee.ff.gg

IOW, new words are connected by the dot.

When searching through web interface, for example if the text is "kk hh.pp", then the splitted words are "kk" and "hh.pp".
IOW, when searching, we take "dot" separated words as a whole.


the handling for csv interface, and the xapian based indexer is not modified. If the above form is okay, I will do the remaining stuff.
History
Date User Action Args
2015-04-03 21:57:41introomsetfiles: + indexer.patch
nosy: + ezio.melotti
messages: + msg2961
2015-04-02 20:10:51introomsetnosy: + introom
messages: + msg2958
2013-04-08 14:55:17r.david.murraysetstatus: unread -> chatting
nosy: + r.david.murray
messages: + msg2725
2013-04-08 14:03:36lemburgcreate