slides

4 downloads 123 Views 738KB Size Report
Towards Base Rates in Software AnalyXcs. Early results and challenges from studying Ohloh. Magiel BrunXnk (Uni. van Amsterdam). Benevol 2013. I agree with ...
Towards  Base  Rates  in  So0ware  Analy4cs  

Early  results  and  challenges  from  studying  Ohloh  

 

Magiel  Brun4nk  (Uni.  van  Amsterdam)   Benevol  2013        

I  agree  with  Harald  and  Gregorio,  so:   Find  my  data,  code  and  replica=on  details  (Wiki)  on   github.com/OhlohAnaly4cs  

Why  should  we  care  about  base  rates   in  so0ware  analy4cs?   “Our  research  on  300  projects  shows  that  our  method   predicts  project  failure  with  70%  precision.”  

Projects  that  are  predicted  to  fail  (200)   Projects  that  fail  (140)  

Why  should  we  care  about  base  rates   in  so0ware  analy4cs?   Let’s  say  the  method  says  “your  project  will  fail.”   What  is  the  chance  of  failure?  

Projects  that  are  predicted  to  fail  (200)   Projects  that  fail  (140)  

Why  should  we  care  about  base  rates   in  so0ware  analy4cs?   70%  is  not  the  right  answer!     This  is  the  base  rate  fallacy.   Projects  that  are  predicted  to  fail  (200)   Projects  that  fail  (140)  

Why  should  we  care  about  base  rates   in  so0ware  analy4cs?   You  have  to  take  into  account  the  base  rate:   “How  many  projects  fail  in  the  first  place?”  

Projects  that  are  predicted  to  fail  (?)   Projects  that  fail  (?)  

All  projects  (?)  

Why  should  we  care  about  base  rates   in  so0ware  analy4cs?   Projects  that  fail  out  of  1000:          200  (example  base  rate)   Projects  that  do  not  fail  but  test  posi4ve:    (1000-­‐200)  *  (100%-­‐70%)  =  240   Project  failure  chance  given  posi4ve  test:    200  /  (200+240)  =  45%  or  less    

Projects  that  are  predicted  to  fail  (440)   Projects  that  fail  (200)  

All  projects  (1000)  

What  is/are  so0ware  analy4cs?   So0ware  Analy4cs  is  a  research  focussing  on:   –  Data  on  so0ware  artefacts  (code,  documents)  and   projects  (people,  ac4vi4es)   –  Appropriate  sta4s4cs  and  data-­‐driven  methods   –  Ac4onable  insight  to  users,  developers,  decision   makers,  etc.  

Example  of  so0ware  analy4cs  in  prac4ce:   –  So0ware  Improvement  Group  

 

Don’t  we  know  these  things?   It  turns  out,  there  is  work  in  this  area:  

–  “Survival  analysis  on  the  dura4on  of  open  source  projects”,  Samoladas   et.  al.,  2010   –  “Reclassifying  Success  and  Tragedy  in  FLOSS  Projects”,  Wiggins  and   Crowston,  2010   –  “A  sta4s4cal  examina4on  of  the  proper4es  and  evolu4on  of  libre   so0ware”,  Herraiz  Tabernero,  2008   –  “So0ware  Assessments,  Benchmarks  and  Best  Prac4ces”,  Capers   Jones,  2007   –  …    

But  it  can,  and  should,  be  extended  further:  

–  Data  sets  are  small,  or  focussed  on  1-­‐10  languages  (most  research)   –  Or  focussed  on  only  open  source  (most  research)   –  Or  data  sets  are  not  available  at  all  (Capers  Jones)  

How  to  start  obtaining  base  rates?   Through  large-­‐scale  data  collec4on  and  careful   analysis     Leveraging  the  ocean  of  data  now  available  on   open  source  so0ware  (600  K  projects  on  Ohloh)     Researching  the  applica4on  of  analyses  from   other  fields  such  as  medicine  or  economics  

Gathering  some  data   Work  in  progress:  collec4ng  and  analysing  data   from  Ohloh  hlp://www.ohloh.net/     Data  set:  monthly  record  of  history  for  12,360  open   source  projects   –  Code/comments/blanks  added,  deleted,  total   –  Number  of  contributors   –  Number  of  commits   –  Main  programming  language   –  Other  meta  data  

Architecture   exportProjectFacts  

hlp  

xml  

processProjectFacts  

csv  

Output  

validateData  (1st   phase)  

xml   obtainProjectFacts  

projectFactsRepository   xml  

Sleep  

Logging  

700  lines  in  Rascal  files  in  total,  700  lines  of  R,   18  lines  Java  

Repository  

Caching   values  

Cache  

Data  set  by  10  most  popular  main   programming  languages  on  Ohloh   Number of projects

2000

1500

1000

500

shell script

Ruby

Perl

C#

JavaScript

PHP

Python

C++

C

Java

0

“Other”   category  

Early  results:  Project  inac=vity   Ques4on:  What  is  the  rate  of  OSS  projects   becoming  inac=ve?     Metric:  Probability  of  Con=nued  Ac=vity     –  Measured  by  a  Kaplan-­‐Meier  es4mate  based  on   (right-­‐censored)  inac4vity  events.  A  project  is   considered  to  suffer  from  inac4vity  if  it  has  0   commits  in  a  year  (of  age).  

Early  results:  Project  inac=vity   Kaplan−Meier estimate for 10811 projects 1.00

Probability of Continued Activity

Median   “survival”:   8.5  years  

0.75

Older  projects   “die”  slower  

0.50

0.25

0.00 0

10

20

Age of project in years

30

Early  results:  Code  growth   Ques4on:  What  is  the  yearly  code  growth  rate  of   OSS  projects?     Metric:  Indexed  Code  Growth   –  An  index  of  the  code  size  at  the  end  of  a  year   compared  to  the  beginning  of  the  year.  For   example,  a  value  of  1.05  represents  5%  code   growth  since  the  beginning  of  the  year.  

Early  results:  Code  growth  

1.50

● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ●

Note  the   number  of   outliers  

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ●



● ● ● ● ●

● ● ● ●



● ●

● ● ● ●

Years of age ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.25



● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

1

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

2



● ● ●

3

● ● ● ●

● ● ●

● ● ●

● ● ●

● ●



● ●



● ● ● ● ●

● ● ● ● ●

● ●





● ● ● ●

● ●

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ●

● ● ●









● ● ●

● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ●

● ● ● ● ●







● ●

● ● ●

● ● ●







● ● ●

● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.00

● ● ● ● ● ● ● ● ●

● ● ● ●

● ●



● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●

PHP:  build  once,   “never”  change?   ● ●

● ●



● ● ● ●

● ● ● ●

● ● ●

● ● ● ● ●





● ● ● ● ●

● ● ● ● ●

● ●

● ● ● ●









● ● ●

● ● ● ●







● ●

● ● ● ●



● ● ● ●













● ●

● ●





● ●











● ●











● ● ●

● ●





● ●

The 10 most used main programming languages in the data set ● ●













● ● ●





● ●







● ●



● ●

JavaScript

● ● ●

● ● ● ● ● ● ●

● ●

Python



● ●

PHP

● ● ● ● ● ● ●

C

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

C++

● ●

● ● ● ● ● ● ● ● ● ●

● ●



● ● ● ● ● ●

Ruby



● ●





shell script

● ●

● ● ●

Java

Mainstream   languages  have   similar  growth   palerns  “on   average”  





● ● ● ● ●

● ●







● ● ● ●





● ●



Perl

1.75

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

C#

Yearly Code Growth Index

2.00

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●









Challenge:  data  quality   Ini4al  inves4ga4on  into  the  Ohloh  data  reveals  that  at  least   15%  of  the  cases  are  suspect  or  wrong   –  –  –  –  – 

Inconsistencies  (LOC  do  not  always  add  up?)   Implausibili4es  (LOC  nega4ve?)   Source  repositories  badly  configured  (SVN)   Missing  data   Events  where  code  is  moved  or  imported  

How  to  deal  with  this  problem  for  big  so0ware  datasets?  

 

–  Looking  into  machine  learning  tools  (e.g.,  Gaussian  processes)   to  automate  detec4on  of  issues   –  Build  up  a  manually-­‐verified  sample  and  compare   –  Cross-­‐validate  with  other  datasets  

Challenge:  beyond  open  source   Base  rates  should  not  be  limited  to  only  open   source  so0ware     How  to  obtain  sufficient  industrial  so0ware   data?  This  is  a  call  to  industry  to  share  data.     Ques4on:  Are  industrial  so0ware  and  open   source  so0ware  (projects)  really  different?  

Conclusion     Base  rates  are  needed  to  avoid  fallacies     Open  source  data  enables  doing  this  research     Challenges  ahead,  and  help  of  industry  is   needed!  

Thanks!  Ques4ons?   Find  the  data,  code  and  replica4on  details  on   github.com/OhlohAnaly4cs     Magiel  Brun4nk   University  of  Amsterdam   [email protected]   020  525  8201