Sunday, November 29, 2009

Those lying numbers

There are more than one way to contribute to open source projects. Unfortunately, eclipse dash only shows one aspect of it, the code; and even for this one aspect it does not capture the reality of things. Why is this? Dash counts CVS commit and this misses the key aspect of who authored the code.

So why am I getting to this today? Because people are using dash as the scarecrow on diversity, but I believe it does not represent the reality of each project. Here the case study of p2:

1% for Cloudsmith? 1% for EclipseSource? WTF? This does not look very diverse... Unfortunately these numbers shows exactly what I want: they are bogus. They do not represent the reality of the investment done by those two companies or the number of patches received by individuals. Indeed, Thomas H., Henrik L. and Ian B. have all been regular contributors to the project and know a lot of the code base. In fact I'm sure that if IBM was to pull the plug on the project it would carry on just fine (probably with even more freedom since I would be gone ). Their companies have products based on p2 (I believe this to be a sign of commitment for the size of p2), they come to every call, and are not afraid of taking on big issues, etc...

So why are the numbers so low?

  • Patches committed for others. I have been committing a lot of patches either from the community or on behalf of Thomas H. Unfortunately this again inflate the IBM numbers to the detriment of Cloudsmith or "individuals".
  • Lately the code has been very much in flux caused by a large refactoring (package rename, etc) which inflates the commit count and dilute others commits.
  • Number of IBM committers. IBM has more committers than others on the project thus allowing for more code to be produced. However if those companies were to increase their number of participants (wink, wink) to a number equal to those of IBM, they would then be at par. Maybe should we compare the companies based on the average commit per committer (e.g commitCount / committer).

I'm sure that I'm missing other factors about why those numbers are so low, but you get the point... Though I recognize that almost every project would use a little more diversity, we have to be careful on how numbers are being used. If we want to use dash as a reliable hint on the activity and diversity, then we should revise how the numbers are being computed to take into account: patch author instead of committer, activity in bugs, activity on ML, number of ppl asking questions in forums, etc...


David Carver said...

Pascal, I agree the numbers in DASH only tell a part of the story. But here is the thing, why not bring those more active contributors on as committers? For some reason people still think that being a committer requires a full time 8 hr a day committment to the project, when this is not the case. Unfortunately, I don't think we do a good enough job making this clear.

Also, as you said you can very easily skew the numbers, my own committ count is very skewed, however, the numbers of active committers on a project will greatly skew the diversity of a project as well. If IBM pulls committers on P2, and moves them elsewhere...yes, P2 will continue...but I would say you still have lost a huge knowledge base that will take longer than you may think to replenish.

Project diversity should be used as a guide and not a final indicator, but it is a guide that can be used to help indicate if a project has a problem or not.

Le ScaL said...

At this point, only Thomas H is not a committer and he is on track to be one.
What I was pointing not only happen with external people but also with current committers. The person who sometimes commit patches is the one that knows the code best. This is how we work.

As for giving commit rights, I think we are in violent agreement but we still want people who understands enough of the rationale of the code, and finally it takes convincing to all the other committers to change the current mindset. Also we have been experiencing with the incubator where commit rights are given almost instantly but so far this has only had mixed success.

Ismael Juma said...

Git can help a bit as it has separate author and committer fields. However, even that is not enough for many cases so the Linux kernel uses things like Signed-off-by, Acked-by, Cc, Tested-by, reviewed-by (and maybe others) to better track the people who were involved in a given patch (and in what capacity).