tag-based queries return inconsistent results

michaelmasouras · November 10, 2018, 6:46am

I use redisearch to index hashes, with each hash having a field “tags” with a list of tags. Certain queries with tags return mutually inconsistent results - here’s an example. Sorry for the difficult-to-read escaping of tags

This search yields two resuls - here’s the tags of the first result:

127.0.0.1:6379> FT.SEARCH il1 “@tags:{w\:couchride\:t\:altamotors}” LIMIT 0 1 SORTBY _t DESC RETURN 1 tags

(integer) 2
“p:222588”
1. “tags”
“w:couchride:t:dealer,w:couchride:t:new,t:image,w:couchride:t:altamotors,w:couchride:t:motocross,w:couchride:t:redshift,w:couchride,f:spaceportcyclesnew47,cf:d:132,cf:m:ariv2,FL”

Let’s add one of the tags and re-try the query - we get 0 results

127.0.0.1:6379> FT.SEARCH il1 “@tags:{w\:couchride\:t\:new} @tags:{w\:couchride\:t\:altamotors}” LIMIT 0 1 SORTBY _t DESC RETURN 1 tags

(integer) 0

Same for another tag:

FT.SEARCH il1 “@tags:{w\:couchride\:t\:dealer} @tags:{w\:couchride\:t\:altamotors}” LIMIT 0 1 SORTBY _t DESC RETURN 1 tags

(integer) 0

Other tag combinations can be consistent and return results.

Any pointers on further debugging appreciated?

Thanks,

Michael

michaelmasouras · November 11, 2018, 1:45am

Some more info on this. Documents appear and dissapear out of these tag indexes continuously. When I first add a document via FT.ADDHASH it works great, immediately. But after a few minutes/hours, it starts being inaccessible through the same queries. Here’s another example that’s even stranger:

Let’s try to locate some documents via numerical constraints:

127.0.0.1:6379> “FT.SEARCH” “il1” “@_t:[1540247314 1540247314]” “LIMIT” “0” “1” “SORTBY” “_t” “DESC” RETURN 1 tags

(integer) 3
“p:212587”
1. “tags”
“w:couchride:t:dealer,w:couchride:t:new,t:image,w:couchride:t:cruisers,w:couchride:t:cruiserfamily,w:couchride:t:harley-davidson,w:couchride:t:sportster,w:couchride:t:sportster1200,w:couchride:t:sportster1200iron,w:couchride:t:iron,w:couchride,f:adamecharley-davidson\xc2\xaenew,cf:d:59,cf:m:dsv7,FL”

Cool, let’s add a tag that’s in the document above - no results, like my previous example:

127.0.0.1:6379> “FT.SEARCH” “il1” “@_t:[1540247314 1540247314] @tags:{w\:couchride\:t\:new}” “LIMIT” “0” “1” “SORTBY” “_t” “DESC” RETURN 1 tags

(integer) 0

But now let’s add even more constraints - a query string:

127.0.0.1:6379> “FT.SEARCH” “il1” “@_t:[1540247314 1540247314] XL 1200NS - Sportster iron @tags:{w\:couchride\:t\:new}” “LIMIT” “0” “1” “SORTBY” “_t” “DESC” RETURN 1 tags

(integer) 3
“p:212587”
1. “tags”
“w:couchride:t:dealer,w:couchride:t:new,t:image,w:couchride:t:cruisers,w:couchride:t:cruiserfamily,w:couchride:t:harley-davidson,w:couchride:t:sportster,w:couchride:t:sportster1200,w:couchride:t:sportster1200iron,w:couchride:t:iron,w:couchride,f:adamecharley-davidson\xc2\xaenew,cf:d:59,cf:m:dsv7,FL”

dada!

I run MONITOR while working on this - grep for “p:212587” - these are all the operations that run on the server:

1541898977.358744 [0 127.0.0.1:36488] “HGETALL” “p:212587”

1541899697.998675 [0 127.0.0.1:36564] “HGETALL” “p:212587”

1541900152.900435 [0 35.231.200.123:37024] “HGETALL” “p:212587”

1541900154.328735 [0 35.231.200.123:37024] “HGETALL” “p:212587:bl”

1541900418.142649 [0 127.0.0.1:36636] “HGETALL” “p:212587”

And here’s my FT.INFO just in case it’s useful:

127.0.0.1:6379> FT.INFO il1

index_name
il1
index_options
1. “NOFREQS”
“NOOFFSETS”
fields
1. 1. ctitle
type
TEXT
WEIGHT
“1”
NOSTEM
1. geo
type
GEO
1. _t
type
NUMERIC
SORTABLE
1. tags
type
TAG
SEPARATOR
,
1. price
type
NUMERIC
SORTABLE
1. year
type
NUMERIC
SORTABLE
1. mileage
type
NUMERIC
SORTABLE
num_docs
“31667”
max_doc_id
“3858017”
num_terms
“10505”
num_records
“1.8446744073683358e+19”
inverted_sz_mb
“17592186044297.982”
offset_vectors_sz_mb
“0”
doc_table_size_mb
“315.65262222290039”
sortable_values_size_mb
“336.76803588867188”
key_table_size_mb
“0.81483268737792969”
records_per_doc_avg
“582522628405265.75”
bytes_per_record_avg
“1”
offsets_per_term_avg
“0”
offset_bits_per_record_avg
“-nan”
gc_stats
1. current_hz
“68.944908142089844”
bytes_collected
“174798224”
effectiv_cycles_rate
“0.10952896031468519”
cursor_stats
1. global_idle
(integer) 0
global_total
(integer) 0
index_capacity
(integer) 128
index_total
(integer) 0

Thanks

meirsh · November 11, 2018, 7:56am

Hey Michael,

Can you pleas try to increase the TIMEOUT parameter to something like 10000 and check then.
You can do this by adding “TIMEOUT 10000” right after the loadmodule option when you start the redis.
Let me know if it has any effect of the returned results.

michaelmasouras · November 11, 2018, 11:28pm

It doesn’t seem to make a difference. I also upped MAXDOCTABLESIZE to 10 000 000 just in case - here’s my setup:

82069:M 11 Nov 2018 15:25:41.706 * RediSearch version 1.2.0 (Git=v1.2.0-179-gcc54f9b)

82069:M 11 Nov 2018 15:25:41.706 * concurrency: ON, gc: ON, prefix min length: 1, prefix max expansions: 200, query timeout (ms): 10000, timeout policy: return, cursor read size: 1000, cursor max idle (ms): 300000, max doctable size: 10000000, search pool size: 20, index pool size: 8,

82069:M 11 Nov 2018 15:25:41.707 * Initialized thread pool!

82069:M 11 Nov 2018 15:25:41.707 * Module ‘ft’ loaded from /Users/michaelmasouras/src/redis-5.0.0/redisearch.so

What are the names of the index keys so that I can query them directly?

michaelmasouras · November 12, 2018, 5:58pm

Some additional info:

This has been happening sporadically for the last couple of months, but because I reprocess everything daily and I was in development mode, it wasn’t completely obvious. In the last couple of weeks, I was preparing for a beta launch so I audited the search results and discovered this chaos
Before I used ft.search, I used to keep my own indexes. They illustrate the problem pretty well:

Great:

127.0.0.1:6379> ZCARD w:couchride:w:couchride:t:**dealer:**listings:live

(integer) 17526

127.0.0.1:6379> “FT.SEARCH” “il1” “@tags:{w\:couchride\:t\:dealer}” “LIMIT” “0” “1” “SORTBY” “_t” “DESC” RETURN 0

(integer) 17526

Great:

127.0.0.1:6379> ZCARD w:couchride:w:couchride:t:new:listings:live

(integer) 12187

127.0.0.1:6379> “FT.SEARCH” “il1” “@tags:{w\:couchride\:t\:new}” “LIMIT” “0” “1” “SORTBY” “_t” “DESC” RETURN 0

(integer) 12187

Ooops!

127.0.0.1:6379> ZINTERSTORE dealernew 2 w:couchride:w:couchride:t:dealer:listings:live w:couchride:w:couchride:t:new:listings:live

(integer) 12187 // <-------- CORRECT (all new vehicles are offered by dealers so the “dealers” set should be a superset of “new”)

127.0.0.1:6379> “FT.SEARCH” “il1” “@tags:{w\:couchride\:t\:new} @tags:{w\:couchride\:t\:dealer}” “LIMIT” “0” “1” “SORTBY” “_t” “DESC” RETURN 0

(integer) 623 // <-------- INCORRECT

Thanks

Michael

meirsh · November 13, 2018, 1:53pm

Hey Michael

After checking the RDB I found out that you are correct and there is actually a bug in redisearch.
Please follow this issue for more details about the bug: https://github.com/RedisLabsModules/RediSearch/issues/534
The good news is that I already submitted a fix and its currently on review, please follow the PR so you will know when the fix is merged to master : https://github.com/RedisLabsModules/RediSearch/pull/535
Bad news is that there is no workaround here, you must upgrade in order to get the fix, you can either wait for 1.4.2 version (which planed to be release soon) or get the fix directly from master. At least you do not have to re-index your data, just use the same RDB with the fixed version and it should work correctly.

Thanks for reporting the issue!

Meir

michaelmasouras · November 13, 2018, 6:38pm

This is a huge relief, thank you for looking into it and fixing it so fast.

How soon do you think 1.42 will be out? I 'd rather deploy a tested release everywhere than build from trunk.

meirsh · November 15, 2018, 8:08pm

Probably something like two weeks … maybe less.

michaelmasouras · November 20, 2018, 12:24am

I tried to build from trunk. The fix addresses the issue correctly on my Mac, but fails to do so on production Linux.

The environments have:

same build of redisearch from trunk
same rdb file
ft.info is identical in both machines

MAC

macbookpro:redis-5.0.0 $ uname -a

Darwin macbookpro.local 18.2.0 Darwin Kernel Version 18.2.0: Fri Oct 5 19:41:49 PDT 2018; root:xnu-4903.221.2~2/RELEASE_X86_64 x86_64

macbookpro:redis-5.0.0 $ redis-cli ft.search il1 “@tags:{w\:couchride\:t\:new} @tags:{w\:couchride\:t\:dealer} @geo:[-81.3792365 28.5383355 100 mi]” LIMIT 0 0

(integer) 7344

LINUX

while production ubuntu doesn’t:

michaelmasouras@staging:/home/redis$ uname -a

Linux staging 4.15.0-1024-gcp #25-Ubuntu SMP Wed Oct 24 13:09:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

michaelmasouras@staging:/home/redis$ lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 18.04.1 LTS

Release: 18.04

Codename: bionic

127.0.0.1:6379> ft.search il1 “@tags:{w\:couchride\:t\:new} @tags:{w\:couchride\:t\:dealer} @geo:[-81.3792365 28.5383355 100 mi]” LIMIT 0 0

(integer) 47

(1.86s)

Please lmk if you don’t see that discrepancy so I can debug further, but I have been trying this for a few hours now and I get consistent results in multiple production build machines vs my MacBook.

Michael

michaelmasouras · November 20, 2018, 1:08am

Turns out the issue I just reported is a different one that at the start of the thread. The issue I reported where different tag combinations produce inconsistent results seems to be fixed everywhere (mac/linux).

However, this bug still stands - it’s just not a new one. I confirmed that I am getting the same results for this geo query as with my original rediseearch build, so I don’t think this fix introduced other problems based on this evidence.

Another interesting thing to note is that even if I remove all tag constraints, the geo query always returns 47 results on linux, and that’s even for a radius larger than the earth:

mac (correct):

127.0.0.1:6379> ft.search il1 “@geo:[-81.3792365 28.5383355 100000 mi]” LIMIT 0 0

(integer) 30498

linux:

127.0.0.1:6379> ft.search il1 “@geo:[-81.3792365 28.5383355 100000 mi]” LIMIT 0 0

(integer) 47

Note that there are definitely more than 47 docs with valid geo:

$ redis-cli ft.search il1 “a*” LIMIT 0 10000 return 1 geo | grep “-” | wc -l

2860

in that index.

Michael

michaelmasouras · November 20, 2018, 5:49pm

It seems these queries are just timing out. I am gathering some timings and will get back to this thread.

michaelmasouras · November 20, 2018, 5:58pm

Once I upped the timeout for these geo queries, I got back consistent results. After a little digging I believe it’s the georadius redis queries that are causing the slowdown:

127.0.0.1:6379> “GEORADIUS” “geo:il1/geo” “-80.191790” “25.761680” “10.000000” “mi” COUNT 1

“100002”

127.0.0.1:6379> “GEORADIUS” “geo:il1/geo” “-80.191790” “25.761680” “100.000000” “mi” COUNT 1

“100002”

(0.95s)

127.0.0.1:6379> “GEORADIUS” “geo:il1/geo” “-80.191790” “25.761680” “100.000000” “mi” COUNT 1

“100002”

(0.91s)

127.0.0.1:6379> “GEORADIUS” “geo:il1/geo” “-80.191790” “25.761680” “1000.000000” “mi” COUNT 1

“100002”

(3.32s)

A few questions:

Is this cardinality expected for an index that has 40K documents:

127.0.0.1:6379> ZCARD geo:il1/geo

(integer) 6918748

Is this latency what you’d expect for such a small index size? Is this some denormalized index which explains why smaller radius queries are faster than larger (you’d have to calculate containment in every single document otherwise)?
(minor improvement) adding a constraint of a tag that has 0 documents (doesnt exist actually) doesnt seem to speed things up:

127.0.0.1:6379> “FT.SEARCH” “il1” “@tag:{doesnotexist} @geo:[-80.1917902 25.7616798 1000 mi]” “LIMIT” “0” “0” “SORTBY” “_t” “DESC”

(integer) 0

(3.21s)

Michael

Mark_Nunberg · November 20, 2018, 6:26pm

RediSearch does not have a native geospatial index, it rather offloads this to Redis using GEORADIUS.

It may be that the query engine is evaluating the GEORADIUS query before the tag query. It may be possible to optimize this in a future version and give precedence of cheaper queries over more expensive ones in a boolean query.

Mark Nunberg | Senior Software Engineer
Redis Labs - home of Redis

Email: mark@redislabs.com

Mark_Nunberg · November 20, 2018, 6:27pm

*in an intersection query.

Mark Nunberg | Senior Software Engineer
Redis Labs - home of Redis

Email: mark@redislabs.com

michaelmasouras · November 20, 2018, 6:27pm

Thanks for the reply, do you think it’s expected for the geospatial key to having 7m items for an index size of 37K?

Mark_Nunberg · November 21, 2018, 3:52am

If you’re updating/deleting documents, then older entries will not be removed from the geo index. I don’t believe there is anything we are inherently unable to do (as it’s just checking if the member in the set is a valid docid); but this may be cause for the high cardinality.

Regards,
Mark Nunberg | Senior Software Engineer
Redis Labs - home of Redis

Email: mark@redislabs.com

michaelmasouras · November 21, 2018, 7:26pm

So you mean that as I continue to add and remove documents from my index, this geo index will grow indefinitely? Is that by design os by omission?

Is it safe to remove items from the geoindex manually myself?

Michael

Mark_Nunberg · November 21, 2018, 11:20pm

By omission — in the beginning we didn’t really have any kind of garbage collection. Then we added GC for text indexes… then numeric indexes… and I guess soon, geo indexes.

It is safe to remove the items manually (that’s not the “official” way to do this… but for now, it should work); however, you need to know the numeric document ID of the document you’ve deleted. I believe Meir has implemented a debug command providing you with this info.

Mark Nunberg | Senior Software Engineer
Redis Labs - home of Redis

Email: mark@redislabs.com

michaelmasouras · November 26, 2018, 9:26pm

Hi Meir, what is the debug command for getting the numerical docid?

meirsh · November 27, 2018, 9:24am

FT.DEBUG DOCIDTOID
Notice that you must do it before you delete the document otherwise you will get an error message, also notice that this is an undocumented debug command so it might change/remove in future releases.

Topic		Replies	Views
Inconsistent Results with similar results and similar queries (Exact, prefix and Fuzzy search over multiple terms) RediSearch redisearch	0	489	August 17, 2023
Redisearch 2.0 Queries RediSearch redisearch	6	1669	September 2, 2020
inconsistent results using * and sortby RediSearch	2	756	June 29, 2019
Complex TAG query RediSearch	5	735	August 24, 2020
RediSearch tag query performance and use case RediSearch redisearch , ru201	4	2370	July 17, 2020

tag-based queries return inconsistent results

Related topics