Sunday, November 29, 2009

Parsing statistics

The BBC has published Wikipedia's rebuttal to statistics about a dropoff in editors.  If one uses different criteria and defines "editor" at a minimum of 5 edits a month instead of 1 edit, then the departure rate drops from 49,000 to 4,900 and the actual inflow/outflow of editors is at stasis.

Since those numbers differ by an order of magnitude, I'm curious to see a more in-depth approach.  What percentage of posts by 1-4 edit editors were reverted as vandalism?  And how does that compare against those who edit more frequently?

Possibilities to explore:
  • Good faith newcomers are getting driven off by aggressive reversion?
  • Breaching experiments are on the decline?

I have a hunch that reversion is somewhat more vigilant toward editors who haven't registered, or who have redlinked userspaces.  Also have a hunch that there is a large but limited pool of people who conduct breaching experiments without actually intending to contribute.  In the latter scenario, most of the people whose main intention is to test the edit function have already written "hi there", gotten reverted, and left.

It might be possible to measure those behaviors in relative terms.  One approach would involve tracking how many edits the average new account makes before the user talk and userpage get created, and by comparing the reversion rates of edits from accounts that are fully bluelinked, partially bluelinked (talk page but no user page), and fully redlinked.  Another approach would be to parse the rate of short-history editors whose posts include the words "hi", "hello", or "test".

Saturday, November 28, 2009

Open Progress



Expressing special thanks today to Gerard Meijssen and the Open Progress Foundation for outstanding work opening global access to digitized material from great cultural institutions in The Netherlands.  This week 35,000 images were uploaded to Wikimedia Commons, and due to the cooperative relations he has developed Wikimedian volunteers can obtain high resolution versions of selected highlights upon request, for restoration purposes.

Several months ago I expressed gratitude for these efforts by restoring an 1890s photochrom print of the Amsterdam Centraal railway station.  Today it's Wikipedia's picture of the day being highlighted on the site's main page.

Congratulations, Gerard.  This is for you.  Couldn't have happened to a nicer person.

Monday, November 23, 2009

Giving thanks


Most of the active Wikimedians are aware that the English language Wikipedia is largest in terms of total articles and the German Wikipedia is second, but it it comes as a surprise that the second largest featured program is not German but Turkish.

The English language Wikipedia currently has 2,111 featured pictures.  German has 747 featured pictures.  Yet the Turkish Wikipedia has 976 featured pictures.  That sets the Turkish language site in second place if one counts by Wikipedia editions.  To set this in perspective, the Turkish Wikipedia is nineteenth in overall size with 138,000 articles.  The German Wikipedia is on the verge of crossing the one million article milestone.

This means roughly 1% of the articles in the Turkish Wikipedia have featured pictures.  One featured picture can illustrate multiple articles.  That comes out as a far higher ratio than at any other Wikipedia of substantial size.

What's really interesting is how one medium sized Wikipedia developed a featured picture program that's 25% larger than the German program.  A core of perhaps half a dozen Turkish editors have been scouting other language projects' featured picture programs, translating the captions into Turkish, and adding the images to articles on their Wikipedia.  Featured picture content at the other projects is different enough from one site to another that the Turkish editors could amass three or four thousand featured pictures if they just keep doing what they're already doing.

Of course I hope they also knock on the doors of museums in their home country. Also very curious about whether this generates synergies with text edits and improvements at their Wikipedia.

Two things happen universally when people from other parts of the world see restored featured pictures about their own culture:
  1. They're delighted.
  2. They want to share information.
Now and then I do restorations about Turkish history and culture, in gratitude and to give their project an extra boost.  The most recent of these is the 1917 Ottoman heliograph crew at Huj pictured above.  Its featured candidacy is underway at the Turkish Wikipedia.  Last week the Turkish Wikipedia passed Wikimedia Commons as the project where I have the second most featured content credits.  Darned if I understand the discussion other than the succession of green light icons.

The Turkish editors have created a model that can be emulated.  And I'm very interested in trying a pilot project with another Wikipedia to see whether a caption translation and featured picture drive would provide a shot in the arm in terms of editor participation and article growth.

One place where I'm proposing a pilot program is the Irish language Wikipedia.  They already have a small featured picture program (two dozen images, a quarter of which I restored) and I know one of their administrators (hi there Alison) and have done a few restorations specific to her country and culture.  A few days ago President Kennedy came up in conversation and I asked whether she knew en:wiki has a featured picture of his brother.

There's Robert F. Kennedy at a CORE rally in 1963: the Attorney General of the United States speaking from the steps of the Justice Department in favor of racial equality.

Very cool.

Right now the Irish Wikipedia is the ninety-second largest Wikipedia with 9,274 total articles.  If their editor community is willing I'd like to help them emulate the Turkish featured picture program.

Although I can't speak a word of either language, if a picture is worth a thousand words we can hold a conversation.

Sunday, November 22, 2009

Letter perfect


Thumbnail previews and reduced size views don't always reveal how much work a restoration is going to require.  This World War I era poster is in very good condition.  At the web-optimized reduction for this blog post it hardly seems to need any restoration at all.  A closer look at the full size file, though, shows that this won't be a walk in the park.

It's a mostly good image with several tricky problems including a crease that runs vertically through the Statue of Liberty's hand and torch.  Today let's look at the ink smears.  It isn't unusual to encounter smeared lettering on historic posters.  Often the source of the problem is water damage.  In this instance it only affects the line that was printed in black ink.  Most of the caption is gray and doesn't have this problem, but the entire line of black lettering has ink smudges.

In case you're wondering, this poster translates to say, "Food will win the war - You came here seeking freedom, now you must help to preserve it - Wheat is needed for the allies - waste nothing."  It was printed in 1917.

Here's the most heavily damaged word at full resolution.  I actually perform the restoration at twice that resolution, but this is enough to convey what the work will be.  The basic idea is to trace the outline of each letter and substitute undamaged paper texture in place of the ink smears.  Two factors will make the difference between a mediocre repair and a good one:
  • Paper texture in historic images is not created equal.  Slight differences will occur in brightness, color balance, and roughness.  So the source area has to be chosen with an eye for those subtle distinctions or else the result will look patchy.
  • The letters have to look like they exist naturally within the cloned area.  This means the cloning has to mimic the aliasing that occurs in undamaged regions.

My first passes worked mostly with large areas and a tool setting of 12 to 15 pixels in diameter.  The source area for this cloning comes from an area that looks a bit rougher and more textured than most of the poster.  Our goal here is not to create something that's digitally perfect, but that fits seamlessly with the surrounding image.  The aim is to mimic good printing for 1917.

The narrowest parts between letters have to be done at a tool diameter of five or six pixels.  Unfortunately this is a situation where Photoshop has a big advantage over the current version of GIMP.  The Photoshop clone stamp tool has a sliding option that allows the user to select any percentage hardness.

Hardness affects how much a cloned area blends with surrounding data.  One hundred percent hardness looks like the cloned area was cut with scissors and pasted in.  Zero percent hardness is really soft and smudgy.  Smudgy is what we're trying to get away from, so we do want some hardness here.  But we don't want the text to seem like a ransom note or an old punk rock poster.  So what's needed is something in between.  A static setting is going to lose its subtlety as this work progresses from wide spaces to narrow gaps between letters.  I do most clone stamping at thirty-five percent hardness.  Getting down to half a dozen pixels, though, it helps to be able to drop that to twenty percent.



This is where GIMP users get caught between a rock and a hard place.  Hardness in GIMP is a simple on/off toggle.  If there's a plugin to make that more nuanced I'd love to know about it because the GIMP editors who work with me have real trouble with this sort of challenge.  They can get an acceptable result with the default program if they work hard enough, but it takes them several times as long as it takes me in Photoshop.  GIMP is open source, so if you happen to be a motivated programmer who likes to see this work spread free culture you could do something to help solve this problem.

I fixed the smudges on this row of text in about two hours.  If discussions with good GIMP editors are accurate, multiply that by a factor of three to five for them to get a result of comparable quality.

The full version of the completed restoration can be viewed here.  Below is a glimpse of it.


Saturday, November 21, 2009

Warm welcomes

 
More good news: another editor has gotten into image restoration and doing really good work.  The image above is a historic isothermal chart from 1823.  Jujutacular ran it as his first restoration about a week ago.  His first effort was impressive and it was a treat to see that he had gotten it from the New York Public Library website.  The pool of resources for source material is broadening.

The version Jujutacular ran as his first nomination is better than my early work.

That represents quite a lot of cleanup.  His effort really shows on the upper margin and far left.  I was on the fence about the nomination--didn't want to rain on his parade yet thought the restoration could go even farther.  Took a chance and offered to collaborate.  It turns out he's a really good sport, eager to learn, and a joy to work with.

He had a version saved without the histogram adjustment.  Smart fellow!  We traded off on additional dirt and smudge removal; both of us applied masks to correct the uneven brightness.  I added a perspective crop, patched in a margin at the lower right edge, and did the final tweaks with curves and color balance.  Sometimes it's magical when the final work feels like a time machine.  This was one of those occasions.

There's nothing quite like the moment of enjoying an editor's reaction for the first time when he realizes, "I did this."  So cheers to Jujutacular.  Looking forward to seeing his next project.

----
Per request, adding links to the full versions: