Log in

No account? Create an account
Previous Entry Share Next Entry
ApacheBench testing of XML Parsing
So, I've been working on a project called PeopleAggregator, and we've been talking about integrating with a lot of different platforms, among them Drupal. (For the record, this is completely unrelated to the MT stuff that went on today. I may write on that later, but really, everyone else has said what I would in a million different ways.) Anyway, we were talking about RAP and how it's too bulky and slow to work for what we need.

So, we got a guy on the team - Joel De Gan, who's working on the PeoplesDNS project for us, and he offered to write us a parser. This is going to be a replacement for RAP, for those of us who can't deal with the slowness of RAP.

Now, I don't know much about RAP. And I don't know much about PHP, or parsing XML, or really anything - I pick up the bits I need to know as I go along. So I'm just kind of standing on the sidelines, but today, I got a demo of what Joel's parser can do.

LiveJournal FOAF files are typically big. Mine is no exception - over 100 friends, random contact data, etc. All in all, a 40KB document about me. I want to parse this data. So I attempt to using both RAP and Joel's parser.

To alleviate network traffic conditions, I copy the file I want locally. To simulate the action of opening a file and reading it, I did keep it on the webserver, so I will admit there may be some kind of bias in that, but I used the exact same method to open the file in both cases (fopen) so I don't think that's an issue that would cause any major difference. I also disabled all printed output.

Anyway, I used this file to check the parsers. Using ab (apache benchmarking utility - fetches a page a bunch of times and tells you how long it took). Using a 50 request check, I got averages on the two parsing utilities:

Joel De Gan's XML parser, parses data into a multileveled array as displayed at http://crschmidt.net/parse/parse.php (source available):

Requests per second: 11.25 [#/sec] (mean)
Time per request: 88.92 [ms] (mean)
Time per request: 88.92 [ms] (mean, across all concurrent requests)

(Full Stats)

RAP, parses into RDF models. (source, + RAP. The parser isn't actually here):

Requests per second: 1.35 [#/sec] (mean)
Time per request: 739.82 [ms] (mean)
Time per request: 739.82 [ms] (mean, across all concurrent requests)

(Full Stats)

So, we've got a parser that to a guy like me seems simpler to use (advanced data structures are part of the limited experience I did get from LiveJournal), is lightweight (one file, as opposed to 256 in RAP), and faster by an order of magnitude.

That, to me, sounds like a winner. Props to Joel for his great work. His next step is to implement OWL capabilities into RDF parsing, and that's going to kick even more ass. As Eric said at one point about this: "Be still my beating heart."

  • 1
I have absolutely no idea what happened with MT, other than hearing vague rumors that sixapart may have gotten attacked. Do you have any links or anything?

... Attacked? Well, I suppose that could be what you would call every "blogger in the blogosphere" deciding that you've "jumped ship", "gone corporate", or in other ways fucked up your licensing.

http://secure.sixapart.com/ just about sums it up. Basically, if you have more than one user on your SixApart site, you're going to have to shell out cash: at regular account prices, even for three authors and a commercial license, $100 bucks, for the new version.

Needless to say, many users don't have that kind of cash - or if they do, don't want to spend it on a project like Movable Type.

Mena's article is at http://www.sixapart.com/corner/archives/2004/05/its_about_time.shtml . The trackbacks from the post tell the tale better than I need to.

The message Six Apart is sending is loud and clear: We don't want personal users. The new version isn't for you. Sorry.

ahh. For some reason I heard that their site had been hacked or something. I could well have misunderstood.

I bow to your supreme nerdiness and geekdom.

Though I only vaguely understand this, I do concede it sounds very cool, and I'm curious about the generalities of the old system of data parsing vs. his new model. It sounds very innovative, but I am curious about the hows. :)

Since http://www.heterosapiens.com/~crschmidt/garfield.xml probably belongs to you (and thus so does dailygarfield), will you please fix it? I'm not getting my Garfield for a week now, and withdrawal pains are here already :-(((

Not possible in the way I'm doing it now. They switched to a javascript form for writing the URL to the comic of the day - meaning that it's randomly generated, and that I can't pull it out of the page.

Sorry. You'll have to find another way to get your garfield cravings.

thanks. you might wish to update the userinfo on that feed, or something.

Oh!!!! It would seem that in their quest to break your feed, UComics swithced to nice URLs like http://images.ucomics.com/comics/ga/2004/ga040511.gif (that's May 11's offering). This maked it all way simpler, doesn't it?

  • 1