Thursday, June 26, 2008

hackystat performance

i started recent discussion on hackystat-dev about some weirdness i was noticing when sending offline data to the hackystat sensorbase.

here is what i said,

  • I had a lot of offline data
  • I executed an ant sensor
  • the sensor started to send the offline data
  • i decided that i didn't want to wait for the offline data and did a Ctrl+C to kill the ant task.
  • the client seemed to recover fine.i didn't check the task manager or anything.
  • but, i had a remote desktop connection to the sensorbase and noticed that the server continued to received data for quite a while. i eventually stopped the server too and restarted it.


  • the discussion continued i got a somewhat strange response from philip that caught me a little off guard.

    > Is it the asynchronous nature of a REST post?
    No, it's the nature of TCP and socket-based communication:

    <http://www.ncsa.uiuc.edu/~vwelch/net_perf/tcp_windows.html>

    Google on "socket buffer size" for more related links.
    Again, this is only my _hypothesis_ as to why you were receiving data after you
    killed the sending process.


    austen and i started to talk about this at work. austen initially agreed with philip. but, i still scratched my head... hm.. doing a search on the "socket buffer size" showed me that

    Typical network latency from Sunnyvale to Reston is about 40ms, and Windows XP has a default TCP buffer size of 17,520 bytes. Therefore, Bob's maximum possible throughput is:

    17520 Bytes / .04 seconds = .44 MBytes/sec = 3.5 Mbits/second

    The default TCP buffer size for Mac OS X is 64K, so with Mac OS X he would have done a bit better, but still nowhere near the 100Mbps that should be possible.

    65936 Bytes / .04 seconds = 1.6 MBytes/sec = 13 Mbits/second



    so... the delay i was seeing couldn't possibly be from 64K buffers could it? i'm not sure. so i designed a little experiment. here is what i did.

  • i created a builds worth of offline data by giving my sensorshell.properties file a bogus password. NOTE a builds worth of data is 11.3 MB.

  • i shutdown the sensorbase (we are using austens postgres version)

  • deleted all the logs from the sensorbase and my client

  • brought up a build for a project and executed ant checkstyle (which calls the sensor

  • i watch the consoles and logs


  • here is some interesting results.

    here is my Checkstyle.log file (edited to save horizontal space):
    22:44:57 Hackystat SensorShell Version: 8.1.530
    22:44:57 SensorShell started at: Thu Jun 26 22:44:56 HST 2008
    22:44:57 SensorProperties
    sensorshell.autosend.maxbuffer : 250
    sensorshell.autosend.timeinterval : 1.0
    sensorshell.logging.level : INFO
    sensorshell.multishell.autosend.timeinterval : 0.05
    sensorshell.multishell.batchsize : 499
    sensorshell.multishell.enabled : false
    sensorshell.multishell.maxbuffer : 500
    sensorshell.multishell.numshells : 10
    sensorshell.offline.cache.enabled : true
    sensorshell.offline.recovery.enabled : true
    sensorshell.sensorbase.host : http://blah:9876/sensorbase
    sensorshell.sensorbase.user : kagawaa@hahaha.hahaha
    sensorshell.statechange.interval : 30
    sensorshell.timeout : 10
    sensorshell.timeout.ping : 2
    sensorshell.properties file location: C:\..\sensorshell.properties
    22:44:57 Type 'help' for a list of commands.
    22:44:59 Host: http://naraku:9876/sensorbase/ is available.
    22:44:59 User akagawa@referentia.com is authorized to login at this host.
    22:44:59 Maximum Java heap size (bytes): 66650112
    22:44:59 AutoSend time interval set to 60 seconds
    22:45:00 Pinged http://blah:9876/sensorbase/ in 188 ms. Result is: true
    22:45:00 Checking for offline data to recover.
    22:45:00 Invoking offline recovery on 48 files.
    22:45:59 Timer-based invocation of send().
    22:46:59 Timer-based invocation of send().
    22:47:59 Timer-based invocation of send().
    22:48:59 Timer-based invocation of send().
    22:49:55 #> quit
    22:49:55 #> send
    22:49:55 Pinged http://blah:9876/sensorbase/ in 0 ms. Result is: false
    22:49:55 Server not available. Storing commands offline.
    22:49:55 Stored 4 sensor data instances in:
    C:\...\sensorshell\offline\2008.06.26.22.49.55.426.xml
    22:49:55 Quitting SensorShell started at: Thu Jun 26 22:44:56 HST 2008
    22:49:55 Total sensor data instances sent: 0



    at the end of the sensor execution it says it can't send 4 sensor data instances! what happened to the server!? hm.. what is going on. for some reason that indicates to me that sending the offline data made the sensorbase unavailable. so, i looked in the Checkstyle-offline-recovery.log file

    22:45:00 Maximum Java heap size (bytes): 66650112
    22:45:00 AutoSend disabled.
    22:45:00 Invoking offline recovery on 48 files.
    22:45:00 Recovering offline data from: 2008.06.26.21.12.01.081.xml
    22:45:00 Found 251 instances.
    22:45:00 Invoking send(); buffer size > 250
    22:45:00 #> send
    22:45:01 Pinged http://blah:9876/sensorbase/ in 204 ms. Result is: true
    22:45:01 Attempting to send 251 sensor data instances.
    Available memory (bytes): 63071744
    22:45:12 Error sending data: org.hackystat....SensorBaseClientException:
    1001: Unable to complete the HTTP call due to a communication error
    with the remote server. Read timed out
    22:45:12 org.hackystat.sensorbase.client.SensorBaseClientException: 1001:
    Unable to complete the HTTP call due to a communication error with the
    remote server. Read timed out at org.hackystat.sensorbase.client.
    SensorBaseClient.putSensorDataBatch(SensorBaseClient.java:827)
    at org.hackystat.sensorshell.command.SensorDataCommand.
    send(SensorDataCommand.java:82)
    22:45:12 Exception during send(): org.hackystat.sensorshell.
    SensorShellException: Could not send data: error in SensorBaseClient
    22:45:12 About to send data
    22:45:12 Successfully sent: 0 instances.
    22:45:12 Did not send all instances.
    C:\...\sensorshell\offline\2008.06.26.21.12.01.081.xml not deleted.


    in the Checkstyle-offline-recovery.log file, i can't find one successful send. but, i know i sent data over. i have no idea but maybe the "Timer-based invocation of send()." was able to send data, but there is no log of that timber-base send. anyway, so there was a problem with the server. but, the logs on the server show nothing. i just get a whole bunch of these

    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:56 k@gm.com Checkstyle CodeIssue
    22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue


    so i have no idea why the server wasn't responding. in the end off all of that i only see 1,742 entries in the database (there might be a problem with the sensordatatype for coverage). so that's 10 minutes of work for a very small amount of data entries. i would guess there were tens of thousands of possible entries in the offline data.

    here is another thing. note that in my Checkstyle.log it says that my sensor process ended at

    last entry in from Checkstyle.log
    06/26 22:49:55 Total sensor data instances sent: 0

    last entry on sensorbase
    06/26 22:55 Put: 2008-06-26T21:11:55 k@gm.com Checkstyle CodeIssue


    thats 6 minutes after the sensor ended. so, getting back to those 64K buffer caches... that seems a little strange to me. if thats really what is happening there must be a lot of caches.

    anyway, there are two things happening.

    1) i can send data to the hackystat sensorbase server way way faster than hackystat can consume
    2) because of that backlog on the server its not letting me send more data

    that worries me. after all 11.3 MB of data is really tiny; after all its just one full build of our system with sensors turned on. here are some questions:

    1) how fast can hackystat consume data?
    2) what is happening such that data is being processed on the server 6 minutes after the client is finished sending?
    3) whats the current bottleneck?
    4) how many users can send data (and how much data) simultaneously without killing the server?

    hm.. there are a lot of moving parts to this so this is just the start of looking into this. but, all i know is that 11.3 MBs is really small. in my other project we move that amount of data in an handful of seconds not even close to minutes.

    ps. i had to delete my offline data. :(

    1 comment:

    Philip Johnson said...

    Some thoughts at:

    http://groups.google.com/group/hackystat-dev/msg/1986b8443b6995ed

    Cheers,
    Philip