Widefinder: Pretty Graphs

Published at

Monday 9th June, 2008

Tagged as

Scaling
Widefinder
Parallelism

This is about Wide Finder 2; I'm trying to apply the technique I discussed earlier, where we aim for minimal changes in logic compared to the benchmark, and put as much of the parallelism into a library. I'm not the only person working in this fashion; I think Eric Wong's approach is to allow different data processors to use the same system (and it allows multiple different languages, pulling it all together using GNU make), and I'm sure there are others.

I don't have any full results yet, because I ran into problems with the 32 bit Python (and forgot that Solaris 10 comes with a 64 bit one handily hidden away, sigh). However I do have some pretty graphs. These are interesting anyway, so I thought I'd show them: they are timing runs for a spread of worker numbers between 1 and 128, working on 1k, 10k, 100k, 1m and 10m lines of sample data. The largest is about 1.9G, so it still happily fits within memory on the T2000, but this is somewhat irrelevant, because the fastest I'm processing data at the moment is around 20M/s, which is the Bonnie figure for byte-by-byte reads, not for block reads. We should be able to run at block speeds, so we're burning CPU heavily somewhere we don't need to.

Graph of run time against processes

At J=128, all the lines are trending up, so I stopped bothering to go any higher. Beyond the tiny cases, everything does best at J=32, so I'll largely concentrate on that from now on. Update: this is clearly not the right way of approaching it. Firstly, the fact that I'm not using the cores efficiently (shown by not hitting either maximum CPU use per process nor maximum I/O throughput from the disk system) means that I'm CPU bound not I/O bound, so of course using all the cores will give me better results. Secondly, Sean O'Rourke showed that reading from 24 points in the file rather than 32 performed better, and suggested that fewer still would be an improvement. So I need to deal with the speed problems that are preventing my actually using the maximum I/O throughput, and then start looking at optimal J. (Which doesn't mean that the rest of this entry is useless; it's just only interesting from a CPU-bound point of view, ie: not Wide Finder.)

You'll see that we seem to be scaling better than linearly. The following graph shows that more clearly.

Graph of M/s against processes

The more we throw at it, the faster we process it. My bet is that ZFS takes a while to really get into the swing of things; the more you fire linear reads at it, the more it expects you to ask for more. At some point we'll stop gaining from that. Of course, that analysis is probably wrong given we shouldn't be I/O-bound at this point.

Graph of run time against data size

(Larger because it's fiddly to read otherwise.) Note that for each J line, from its inflection point upwards it's almost a straight line. It's not perfect, but we're not being hit by significant additional cost. None of this should really be a surprise; processing of a single logline is effectively constant time, with the same time no matter how many workers we have. The things that change as we increase either J or the total number of lines processed are the number and size of the partial result sets that we have to collapse, and other admin; but these differences seem to be being drowned out in the noise.

I'd say there's something I'm doing wrong, or vastly inefficiently, that's stopping us getting better use out of the cores on this machine. That's also more interesting than optimising the actual processing of the loglines.

Finally, the code. No commentary; this is really just Tim's version in Python, but using the parallel driver. I made the final sort stability independent of the underlying map implementation (Hash in Ruby, dict in Python), but that should be it; so far it's given me the same results, modulo the sorting change.

import re, sys, parallel

def top(dct, num=10):
    keys = []
    last = None
    def sorter(k1,k2):
        if k2==None:
            return 1
        diff = cmp(dct[k1], dct[k2])
        if diff==0:
            return cmp(k2,k1)
        else:
            return diff
    for key in dct.keys():
        if sorter(key, last)>0:
            keys.append(key)
            keys.sort(sorter)
            if len(keys)>num:
                keys = keys[1:]
            last = keys[0]
    keys.reverse()
    return keys

hit_re = re.compile(r'^/ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+$')
ref_re = re.compile(r'^\"http://www.tbray.org/ongoing/')

def report(label, hash, shrink = False):
    print "Top %s:" % label
    if shrink:
        fmt = " %9.1fM: %s"
    else:
        fmt = " %10d: %s"
    for key in top(hash):
        if len(key) > 60:
            pkey = key[0:60] + "..."
        else:
            pkey = key
        if shrink:
            print fmt % (hash[key] / 1024.0 / 1024.0, pkey)
        else:
            print fmt % (hash[key], pkey)
    print

def processor(lines, driver):
    u_hits = driver.get_accumulator()
    u_bytes = driver.get_accumulator()
    s404s = driver.get_accumulator()
    clients = driver.get_accumulator()
    refs = driver.get_accumulator()

    def record(client, u, bytes, ref):
        u_bytes.acc(u, bytes)
        if hit_re.search(u):
            u_hits.acc(u, 1)
            clients.acc(client, 1)
            if ref !='"-"' and not ref_re.search(ref):
                refs.acc(ref[1:-1], 1) # lose the quotes

    for line in lines:
        f = line.split()
        if f[5]!='"GET':
            continue
        client, u, status, bytes, ref = f[0], f[6], f[8], f[9], f[10]
        # puts "u, #{u}, s, #{status}, b, #{bytes}, r, #{ref}"
        if status == '200':
            record(client, u, int(bytes), ref)
        elif status == '304':
            record(client, u, 0, ref)
        elif status == '404':
            s404s.acc(u, 1)
    return [u_hits, u_bytes, s404s, clients, refs]

(u_hits, u_bytes, s404s, clients, refs) = parallel.process(sys.argv[1], processor)

print "%i resources, %i 404s, %i clients\n" % (len(u_hits), len(s404s), len(clients))

report('URIs by hit', u_hits)
report('URIs by bytes', u_bytes, True)
report('404s', s404s)
report('client addresses', clients)
report('referrers', refs)

81 LOC, compared to Tim's 78, although that's disingenious to an extent because of the parallel module, and in particular the code I put in there that's equivalent to Hash#default in Ruby (although that itself is only three lines). Note however that you can drive the entire thing linearly by replacing the parallel.process line with about three, providing you've got the accumulator class (19 LOC). In complexity, I'd say it's the same as Tim's, though.

Exporting emails from Outlook

Published at

Friday 6th June, 2008

Tagged as

Outlook
Email
Export
WIP

When I left Tangozebra last year, I had various folders of emails that I needed to take with me. I did what seemed to be the sensible thing of exporting them as Outlook .pst files, copied them onto a machine that was going with me, and thought no more about it.

Then, when I needed them, of course, I couldn't open them. I have Outlook 2002 on my machine at home, but these needed Outlook 2007. Fortunately, there's a demo version you can download and play with for 60 days - long enough to get the data off, but not long enough to just keep them all in Outlook. So I was looking for a way of exporting emails. Outlook actually has a way of doing this, although it's not really practical for the thousands of emails I've accumulated over the years that are important; however the export feature isn't in the demo anyway, so it's somewhat moot.

I scrobbled around the internet for a bit, finally chancing upon a tutorial and sample script for exporting data from Outlook using Python. It uses the built-in email-as-text export feature of Outlook, which frankly is pretty unappealling, lacking as it does most of the headers, and in particular useful things like email addresses. Also, their script outputs emails as individual files, which again is unhelpful: I just want an mbox per folder.

So I wrote an Outlook email reaper. It's happily exported about 4G of emails, although it's a long way from perfect. See the page above for more details.

Claiming the evil namespace

Published at

Wednesday 4th June, 2008

Tagged as

Presentations
Evil
Laziness

One of the crazy ideas that occurred at South By Southwest this year was the general application of evil to presentations. Not entirely unlike Battledecks (but practical rather than entertaining), the reason behind the idea is threefold.

Most presentation slides are terrible, and by repeating what the presenter is saying actually distract from rather than add to the presentation
So replacing slides with random images from Flickr would probably improve most presentations
Evil is fun

It took a bit of time to get up and running, partly because I wanted to be absolutely scrupulous in how I was using other people's images: they must be public, and must be licensed appropriately. However I'm now happy to announce evilpresentation, a simple tool for creating presentations using the power of Flickr, random number generators, web monkeys, and so forth.

In the process of doing this, I of course had to 'claim' a machine tag namespace: evil: is for evil things. Currently, we just have evil:purpose= for the presentation system, but I'm sure someone will come up with some other evil uses in future. Evil is fun.

Mark Norman Francis and Gareth Rushgrove helped come up with the idea, or at least kept on ordering margaritas with me around; I can't remember which (see above, under margaritas).

iPlayer problems

Published at

Monday 26th May, 2008

Tagged as

URI design
BBC
Failure modes

I generally like the BBC's iPlayer; it's not great, but it seems to work. However today I decided I'd watch "Have I Got News For You", based on Paul's accidental involvement. Two little problems.

Firstly, the hostname iplayer.bbc.co.uk doesn't exist. Google has made me expect that this stuff should just work; but that's not a huge problem, because Google itself told me where the thing actually was. However having it on a separate hostname would be a really smart idea, because iPlayer requires Javascript. Using a different hostname plays nicely with the Firefox NoScript plugin, and that just strikes me as a good idea.

The real problem came when I searched on the iPlayer site. Search for "have i got news for you", and you get a couple of results back. Click on the one you want, and you get sent to http://www.bbc.co.uk/iplayer/page/item/b00bdp78.shtml?q=have+i+got+news+for+you&start=1&scope=iplayersearch&go=Find+Programmes&version_pid=b00bdp5g, which was a 404 page which doesn't help very much. I mean, they try, but since this is a link generated by their own site, it doesn't help me very much.

So I thought "that's annoying", and was close to firing up a BitTorrent client instead when I wondered if their URI parser was unencoding the stream, and then getting confused because of all the +-encoded spaces. http://www.bbc.co.uk/iplayer/page/item/b00bdp78.shtml?q=yousuck&start=1&scope=iplayersearch&go=Find+Programmes&version_pid=b00bdp5g, for instance, worked perfectly. (Which doesn't help you very much, as iPlayer URIs seem to be session-restricted or something.)

Google, the Fast Follower

Published at

Sunday 18th May, 2008

Tagged as

Hack
Google
Get real, please

Wow! It’s amazing! You can use Google spreadsheets to calculate stuff! Thank heavens we have Matt Cutts and Google App Hacks to teach us stuff that Excel users have been doing for two decades.

Okay, Google: it’s time to wake up now.

URI Posterity

Published at

Monday 7th April, 2008

Tagged as

Idiocy
Posterity

So I'm having trouble writing a widescreen DVD; I suspect what I actually need to do is upgrade to the all-singing, all-dancing Adobe CS3 Production Premium, which includes Encore and should be able to do everything I want and more. (I don't want much. Honestly.) Before paying lots of money though, I did the "sensible thing" and tried various things I didn't have to pay for, either because they're free or because they're already on my computer.

In the process of doing this, I fired up something that came with one of my DVD writers, probably in the last twelve months. I got the following error box:

This is a perfect example of why URI design is important. Had the program been looking for the URI http://liveupdate.cyberlink.com/product/PowerProducer;version=3.2, say, then there's a reasonable chance that URI scheme would have stayed. That hostname doesn't provide a website (although the root URI returns an HTML document typed as application/octet-stream), just a service. Make the URI easy, and you won't have this problem.

Of course, you should also catch errors. And present them usefully ("I cannot check for updates at this time - perhaps this version is too old and no longer supported?"). But hey, that's experience design, which is nothing to do with the point of this post. (And, it seems, nothing to do with the creation of Cyberlink's PowerProducer product.)

In a telling coda, the URI for the PowerProducer page doesn't really look like it'll last that long either: http://www.cyberlink.com/multi/products/main_3_ENU.html. Sigh.

PNG Weirdness

Published at

Monday 7th April, 2008

Tagged as

PNG
Compression
Doesn't that just make you go 'Oooh'?

So for my previous entry I had to create an image. Screenshot, paste into Photoshop, save as PNG. Done. Now a thumbnail: save-for-web, 50% scale, PNG. Done.

Erm.

-rw-rw-r-- 1 james james 16K 2008-04-07 13:42 error-message-large.png
-rw-rw-r-- 1 james james 33K 2008-04-07 13:43 error-message-thumb.png

Something's not quite right here: the half-sized thumbnail is taking up twice the space.

error-message-large.png: PNG image data, 1001 x 126, 8-bit/color RGB, non-interlaced
error-message-thumb.png: PNG image data, 500 x 63, 8-bit/color RGBA, non-interlaced

Okay, so maybe it's the colour space - RGBA is storing more data. Not, you know, twice as much data, but 32 bit rather than 24 bit is going to hurt in some way. So I go back in, resize the image in Photoshop, and save as PNG using the same options as for the large one.

-rw-rw-r-- 1 james james 26K 2008-04-07 13:45 error-message-small.png
error-message-small.png: PNG image data, 500 x 63, 8-bit/color RGB, non-interlaced

Okay, so reducing the colour space to 24 bit does what we expect: reduces the filesize by a quarter. It still doesn't explain why the original is so much smaller. Okay, well Photoshop also comes with ImageReady, so perhaps that can help.

Input	Dimensions	Input size	Optimised size
large	1001x126	16K	15.29K
thumb	500x63	32.1K	32.12K
small	500x63	25.7K	25.36K
large resized	500x63	~4K	29.04K

The last row is the interesting one: when resizing in ImageReady, it generates a 4K image... then optimises it to 29K.

For what it's worth, GIMP doesn't do any better. Anyone have any idea what's going on, or is it just pixies?

Idiots

Published at

Friday 28th March, 2008

Tagged as

Idiocy
Arrogance

I'll make this quick. Could the idiots in my life please leave?

Webcameron, David Cameron's video page, doesn't work; it is impossible for me to watch older videos on it, because the person who built it is an idiot who doesn't understand what a URI is, or what a Resource is, and probably needs beating round the head with a good book on the subject.
FeedDemon has always been slightly behind NetNewsWire in terms of ease-of-use, particularly in arranging feeds in folders (where FeedDemon would sometimes undo changes a day or two later for no obvious reason); now it's become utterly useless to me, as the synchronisation between the two doesn't actually synchronise, and the folders simply aren't being matched one product to the other. I have no idea why this is the case, but my bet is that they've each implemented synchronisation individually, rather than sharing code.
Firefox still only shuts down cleanly perhaps 20% of the time, meaning that I have to manually kill it to get restarts that preserve tabsets to work. Sometimes I forget, and having to remember to hard kill a program to get it to function isn't really acceptable. Yes, there's an extension to preserve tabsets for me better than Firefox itself manages; no, half the time it doesn't work with new versions of Firefox. My browsing experience is less pleasant than it was with early versions of Mozilla (say, 0.9 through to maybe 1.1). Don't even get me comparing it to the nineties.
I still cannot simply consume bundled (say, one subscription covering a group of studios) multimedia over the internet. Why is this so difficult? A back of the envelope calculation in 2006 said it could be done for under five million dollars, with a profit margin on top of a reasonable subscription. The crazy thing is that the AMPTP is now in a worse position on this kind of thing: failure to innovate weakens the hand of incumbants. Like that's news.
Windows has this stupid thing where it flashes a window's chrome and its taskbar representation whenever the window wants to 'notify' me of something. I don't appear to be able to turn it off, which is a problem when (a) notifications come for things that aren't important, like a transient error or - in the case of Filer windows - simply not being able to display the window contents yet; (b) modal dboxes are considered notifications, when everyone knows that they're used liberally in places they shouldn't be. I run maybe eight applications most of the time, many of which want to tell me something regularly. Most of them have system tray icons to tell me; however they also notify the window, so everything starts flashing and I can't concentrate on anything. There is a multiplicity of idiots at work here.

Okay, so I'm stunningly arrogant, but I really don't like it when my imagination outstrips reality to such a staggering degree. It's one thing to dream of flying cars, but quite another to think of things that are both technologically and economically viable and still wake up and discover they don't exist. Idiots: get to it. In the meantime I have to replace your shit with stuff that works, or find some way of chilling out. Neither should be necessary.

Advertising tech

Published at

Monday 17th March, 2008

Tagged as

Google
Advertising
Technology

John Battelle makes a good point about (a) chasing Google and (b) the key to actually getting somewhere in the advertising market. Of course, this could be considered to be exactly the same as saying "it’s all very well coming up with funky new technology, but does it actually solve a real problem?". For various reasons I’m convinced there is still a huge amount of potential for new tech to offer value to the advertising industry. However I suspect it’s all in the hidden layer behind the scenes (back office, basically): the technology we have for delivering adverts right now is so far from being used to its full potential it does seem a little crazy to be trying to build yet more of it.

No shit, Amazon

Published at

Thursday 28th February, 2008

Tagged as

Amazon
Usability
Mistakes

Every so often on the Amazon US site there's a nifty link that takes you to the same product on the Amazon UK site (maybe just because I'm based in the UK - who knows?). However it doesn't always work; fortunately what does work almost always these days is just changing the hostname from amazon.com to amazon.co.uk. I haven't quite got round to writing a bookmarklet for it, because it honestly is so easy to do anyway.

So it came as a bit of a surprise when changing this US link into this UK link didn't work (even though there is a product page for the item in question - it just has a different ASIN, probably because it's slightly different in a way that only Ubisoft really cares about). However that's not the big problem: not only did it not work, but it came up with a fairly unhelpful page.

Branding, but no chrome: this isn't really part of the Amazon UK site at all. In fact, all we'd need to make this into a helpful page would be a search box. Type in something useful and on you go. But no, as Amazon helpfully point out themselves: the Web address you entered is not a functioning page on our site.

Well, duh.

This is an example of what I'm beginning to think of as Apple trouble; when something simply works almost all of the time, you get disproportionately annoyed when it doesn't.

James Aylett: Recent diary entries