I want to combine Gears, Hashes and Sets. Hashes will store profiles. Sets will be my indexes (storing subsets of Hash keys). Gears will mesh them together with a goal of getting a N-dimensional contingency tables. As an example: “given field value_1234 has possible values {1,2,3} and value_9876 has possible values {1,2} give me the six counts (count of rows where value_1234=1 && value_9876=1, count of rows where value_1234=1 && value_9876=2, etc…) for all profiles that have both fields”.
I get the theory of map/reduce and have used Hadoop and other similar frameworks. In something like “original recipe Map/Reduce Hadoop” the input would be loaded in map tasks each getting something like a file reference or a file reference with an offset+length. How does the Redis Reader distribute the work of generating input data?
To shift from a general question to something highly specific I did a proof of concept. The following code a.) computed the correct results b.) was faster with the “Set indexes” (e.g. faster when I used PythonReader than when I used a KeyReader) c.) the speedup of “b” was proportional to size of the “Set indexes” (e.g. the smaller number of items in the Set index the faster it ran, I think linearly). My concern is when I go from my POC (~100k rows) to my full dataset (~100M rows) I’ll hit a wall. For example it seems to me like my Python would be 1.) running on one shard (I’ll call it “A”) requiring moving remote Set values (which might have up to 100M values) over the network to the “A” shard (for example if set value_1234 is on “A” and value_9876 is on “B”) and 2.) all of the data I’m yielding starts on “A” so to distribute work to the other N-1 shards would mean moving the values over the network… no?
def SetAsIndex():
#find all rows that match an ad hoc boolean query of fields, this is value_1234 ^ value_9876
res = execute('SINTER', 'value_1234', 'value_9876')
for x in res:
res2 = execute('HMGET', 'profile_'+x, 'value_1234', 'value_9876')
#python isn't my daily language so I'm sure there is a better way of writing this...
retDict = {'key':'profile_'+x}
valueDict = {}
valueDict['value_1234'] = res2[0]
valueDict['value_9876'] = res2[1]
retDict["value"] = valueDict
yield retDict
bg = GearsBuilder('PythonReader')
#uncomment for the full scan version
#bg = GearsBuilder();
#bg.filter(lambda x: 'value_1234' in x['value'] and 'value_9876' in x['value'])
bg.groupby(lambda r : r['value']['value_1234'] + ',' + r['value']['value_9876'], lambda key,a,r: 1 + (a if a else 0))
#bg.run('profile_*')
bg.run(SetAsIndex)