Saturday, November 17, 2007

chat about smart snapshots

the hackystat gang just finished up a long thread about large datasets in DPDs. i fought long and hard in that thread to no avail. but, i understand the competing idea and don't necessarily think its bad. both solutions have its advantages and disadvantages.

through chatting i kinda uncovered something else that i kinda don't like. a disadvantage in my mind - but not necessarily wrong. here it goes:

aaron: where is the smart snapshot?
aaron: the smart thing is going in reverse order?
austen: right. since snapshots are the latest data set
aaron: so you get thirty minutes chunks.
austen: ya
aaron: but what if the timestamps just so happen to span across the thirty minutes?
austen: then u would be done
austen: im not sure wut u are getting at
aaron: hm...
aaron: timestamp 1:59
aaron: timestamp 2:00
aaron: timestamp 2:01
aaron: all part of the same runtimestamp of 2:04
aaron: if you get timestamps from 2:00 - 2:30
aaron: then you are missing part of the data.
aaron: you have no idea whether the data actually spans more than 30 minutes.
aaron: the batch could be hours and hours long.
aaron: but the runtimestamp could be at 2:04
aaron: you'll never know when to stop.
aaron: so, i'm saying that algorithm is an approximation.
aaron: unless there was a way to say get all data with runtimestamp = 2:04
austen: wouldn't u just loop backwards till u find the end of the snapshot?
austen: within the specified time span
aaron: when do you know its done?
aaron: what is all the data?
aaron: theoretically you'll never know if you are done.
aaron: in practice i guess you could say if i don't get any data in the next bucket i look at then i guess i can call it done (aprox).
aaron: still don't get it?
austen: u know its done when u are getting data whose run time is different.
aaron: not.. necessarily.
aaron: because batched data can be mixed right?
aaron: you could be sending data simultaneously
aaron: from the same user
aaron: intermixing data with different runtimestamps but very similar timestamps
austen: if its from the same tool invocation , the sensor will send runtimes that will be the same
austen: so no
aaron: what...
aaron: what if i had two windows open. run full-build at the same time.
aaron: the windows one is a better example.
austen: are u saying that it is broken cause the data will not be ordered?
aaron: the data will be ordered but based on timestamp
aaron: not runtimestamp
austen: yah. ok I see
aaron: basically we are doing approximation.
aaron: we are hoping that the 30 minute chunk has all the data.
aaron: and that it didn't actually span over 2 hours
aaron: becasue there is no way to actually know if it spanned over 2 hours
aaron: which is fine... i suppose.
aaron: that is what happens when you have smart on one side and not flexible on the otherside.
aaron: we need smart and flexible to be optimized and exact
aaron: right now we are sort of optimized and approximated. which is fine.. but people need to understand that is the case.
aaron: and it will be harder to debug if the approximations aren't that great.

to sum up, the current "smart" snapshot idea will definitely speed things up a lot. however, i think there are scary disadvantages. it is an approximation - theoretically you have no idea how big snapshots are. so, when can you actually stop? in this method the be exact you would have to search the entire day to be sure you have every single data point. but, we probably won't do that. in fact, i think people will implement it differently. so there will be different approximations. that might be bad. approximations might make it harder to debug. but on the other hand, it isn't wrong to do approximations.

anyway... that is all i have to say about that. i'm hoping that the performance issues will be solved pretty quickly with the methodologies that we are adopting.

No comments: