forked from mfejzer/tracking_buggy_files
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pick_bug_freq.py
executable file
·1042 lines (851 loc) · 39.2 KB
/
pick_bug_freq.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""Usage: %(prog)s [options] <json-file> <repository-path>
Calculates bug fixing recency and bug fixing frequency for all paths
for which is it non-zero from 'bug_report' information in the
<json-file>, with the help of project history from given
<repository-path>. Prints the result, currently in JSON format, on
its standard output or to given file. As a side effect it sorts data
by the bug report timestamp (which should be the same as sorting it by
bug id).
The extra information is added to the [possibly new] 'views' object.
This data belongs neither to 'bug_report', not to 'commit', as it can
contain data from one or the other, or even both,
* bug fixing recency, section 3.4.1:
f_5(r,s) = (r.month - last(r,s).month + 1)^-1
* bug fixing frequency, section 3.4.2:
f_6(r,s) = |br(r,s)|
where:
* r - set of bug reports (for which fix commit exists and is known)
* s - file path (of a file in a project)
* br(r,s) - set of bug reports, for which file 's' was fixed before
report 'r' was created
* last(r,s) ∈ br(r,s) - most recent previously fixed bug
How is br(r,s) computed, with inductive algorithm
-------------------------------------------------
Let $B$ be a set of bug reports, and $S(t)$ be a set of all files in
the snapshot of the project at given time $t$ (and $S(c)$ set of all
files at given commit $c$). Let $r \in B$ be a bug report, $r.f$ be a
bugfix commit for a given bug report, and $t_r$ be its creation time.
We will use $changes(c)$ to denote all files changed in commit $c$.
Let's assume that $S(t_r)$ is a good approximation for the version
that bug report was created for (the version that author of the bug
report was using when he/she had found the bug in question). We will
uses $S(r.f)$ for an approximation for this state, as in the original
work.
We can then define $br(r,s)$ in the following way:
\begin{equation}
\label{eq:br-def}
br(r,s) = \{b \in B: b.f < t_r \and s(t_b) \in changes(b.f)\)
\end{equation}
It should be noted that the ordering relation $<_{br}$ defined as
$$a <_{br} b <=> a.f < t_b$$ does not form strict ordering.
If bug $b$ is created while another bug $a$ is still open, then
neither $a <_{br> b$ not $b <_{br} a$ is true.
The ordering by the bug report date (timestamp) however is is in fact
strict order: $$a <_{rep} b <=> t_a < t_b$$.
Let's define $succ(r)$ for $r \in B$ to be the next bug report in the
bug report order after $r$, that is oldest bug report that was created
after $r$.
Let's assume that $fx(a,b)$, where $a, b \in B$ to be a set of bugfix
commits, such that $$c \in fx(a,b) <=> t_a <= t(c) <= t_b$$, where
$t(c)$ is the commit timestamp. NOTE that we will (for now) assume
that if $t(a.f) <= t_b$, for bugfix commit $a.f$ and bug reports $a$
and $b$, then commit $a.f$ could be in $br(b,s)$ for some path $s$.
The inductive algorithm goes as follow:
* For the first bug report $r_0$ in $B$ (in the order of bug report
creation date) the set $br(r_0)$ (defined as $br(r,s)$ for all
paths, that is: $br(r) := {s \in S(r_0): br(r,s)}$) is empty;
there are no earlier bug reports, thus no bug reports for which
bug fixing timestamp (which is always later than bug report
timestamp: $t_a < t(a.f)$ for each $a \in B$) is earlier than
this bug report.
* For each subsequent bug report, we update br(r-1) to br(r) in the
following way
1. Find all bug fixes that were not taken into account, that is
candidates = fx(r_last, r)
2. If not empty, update r_last
3. For each bug report x in candidates, translate br(*,*)
to t(x.f^), and then include x it in br(*,t(x.f^))
4. Translate final result to t(r.f^)
5. Compute all features dependent on br(r,*)
Here we have used the folloring definition for br(r;t)
\begin{equation}
\label{eq:br_t-def}
br(r;t) = \{(b,s) \in S(t) x B: b.f < r.f \and s(t) \in S(t)
\and s(b.f^) \in changes(b.f).preimage \}
~= \union_{s \in S(t)} br(r,s)
\end{equation}
The description below is taken from "Mapping Bug Reports to Relevant
Files: A Ranking Model, a Fine-Grained Benchmark, and Feature
Evaluation" (chapter numbers before slash) and preceding work
"Learning to Rank Relevant Files for Bug Reports using Domain
Knowledge" (chapter numbers after slash).
3.5 / 3.4.1 Bug-Fixing Recency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The change history of source codes provides information that can help
predict fault-prone files. For example, a source code file that was
fixed very recently is more likely to still contain bugs than a file
that was last fixed long time in the past, or never fixed.
As in Section 3.2, let 'br(r,s)' be the set of bug reports for
which file 's' was fixed before bug report 'r' was created. Let
'last(r,s) ∈ br(r,s)' be the most recent previously fixed bug.
Also, for any bug report 'r', let 'r.month' denote the month
when the bug report was created. We then define the bug-fixing
recency feature to be the inverse of the distance in
months between 'r' and 'last(r,s):
f_5(r,s) = (r.month - last(r,s).month + 1)^-1 : (8)
Thus, if 's' was last fixed in the same month that 'r' was created,
'f_5(r,s)' is 1. If 's' was last fixed one month before 'r' was
created, 'f_5(r, s)' is 0.5.
3.5 / 3.4.2 Bug-Fixing Frequency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A source file that has been frequently fixed may be a fault-prone
file. Consequently, we define a bug-fixing frequency
feature as the number of times a source file has been fixed
before the current bug report:
f_6(r,s) = |br(r,s)| : (9)
This feature will be automatically normalized during the
feature scaling step described in Section 3.7 below
This view is, as far as I understand it, intended to represent quality
of the code in the given file, based on bug reports and bug fixes.
"""
from __future__ import print_function
import json
import sys
import os
import argparse
from collections import OrderedDict
from datetime import timedelta
from copy import deepcopy
import pandas as pd
from tqdm import tqdm
from git_utils import \
retrieve_changes, retrieve_files, \
retrieve_changes_status
from date_utils import convert_commit_date
import args_utils
from datastore_utils import \
NestedDictStore, select_datastore_backend
from misc_utils import cmp_char
from fix_and_augment import \
fix_commit_metadata, sorted_by_bugreport, sorted_by_commit
from dataset_utils import \
list_of_bugfixes_to_storage
br_format = 'list'
def main():
### run as script
global br_format
# parse arguments
parser = argparse.ArgumentParser(
description="Calculate bug report recency and frequency",
epilog="NOTE: saves stats in the CSV file, if -o/--output provided")
parser = args_utils.add_arguments_and_usage(parser, repo_required=True)
parser = args_utils.add_json_output_options(parser)
parser = args_utils.add_git_backend_selection(parser)
parser.add_argument('--count-files', action='store_true',
help='count files in each bugfix revision')
parser.add_argument('--br', default='list',
choices=['list', 'sparse', 'bitmap', 'none'],
help="storage format for br(r,s)"+
" [default '%(default)s']")
args = parser.parse_args()
# process arguments
datafile_json = args.json_file
repo_path = args.repository_path
indent_arg = args.indent
output_file = args.output or sys.stdout
br_format = args.br
# select git backend
repo = args_utils.repo_by_backend(repo_path, args.git_backend)
# read and process data, preserving order of keys
data = args_utils.read_json(datafile_json, preserve_order=True)
store = select_datastore_backend(args.output.name, category='bug_fixing')
# augment original bug report data, if supported
# switch to supported br(r,s) formats
if isinstance(store, NestedDictStore):
store.update(data)
if br_format != 'list' and br_format != 'none':
print("--br=%s not supported for JSON, switching to 'list'" %
(br_format,), file=sys.stderr)
br_format = 'list'
store, stats = process_data(store, data, repo, count_files=args.count_files)
store.print_data(args)
print('Repository: %s' % (repo_path,),
file=sys.stderr)
process_stats(stats, output_file)
def process_stats(stats, output_file):
"""Print summary of statistics and optionally save them to a file
The data is [optionally] saved to file in CSV (Comma Separated Value)
format are full per-bug statistics; this is done if `output_file` is
not sys.stdout. What is printed is stats summary / description
of data in the table.
Parameters
----------
stats : dict
Statistics for each commit / bug report, to be saved as CSV
and/or to generate summary of overall statistics of bug
reports and the repository. Assumed to be generated by
`process_data()`.
output_file : file | sys.stdout
Original file to which JSON output is saved, or sys.stdout if
JSON output is to be printed. If it is not sys.stout, its
name is used as name of CSV file to saave full statistics to.
Side effects
------------
Save data to *.stats.csv file if `output_file` is not sys.stdout.
Print summary of data in `stats` to stderr.
"""
df = pd.DataFrame.from_dict(stats, orient='index')
df.sort_values(by='fixes', inplace=True)
if output_file != sys.stdout:
stats_filename, __ = os.path.splitext(output_file.name)
stats_filename += '.stats.csv'
df.to_csv(stats_filename)
print("Saved stats to '%s' (%s)" %
(stats_filename, args_utils.sizeof_fmt(os.path.getsize(stats_filename))),
file=sys.stderr)
df_sum = df.filter(
items=[
# commit stats
'n_changes', 'n_effective',
'n_renames', 'n_additions', 'n_deletions',
'n_changes_eff',
# translation stats
'n_changes_since',
'n_deletions_fixed', 'n_renames_fixed',
# translation to 'r'
'n_del_fixed_curr', 'n_ren_fixed_curr',
# bool-valued stats
'has_prev_in_new',
]
)
# maybe also 'renames_since', 'deletions_since',\
for field in ('renames', 'additions', 'deletions',\
'renames_fixed', 'deletions_fixed',\
'del_fixed_curr', 'ren_fixed_curr'):
df_sum['with_'+field] = df_sum['n_'+field] > 0
print(pd.DataFrame(df_sum.sum(),
columns=['sum']),
file=sys.stderr)
# labels=[...], axis=1 is equivalent to columns=[...], latter from 0.21.0
#
# possibly DataFrame.reorder_levels(order=[...], axis=1) instead
# of using DataFrame.sort_index(axis=1)
print(df.drop(labels=['fixes', 'timestamp'], axis=1)
.sort_index(axis=1)
.describe(percentiles=[.5]).T,
file=sys.stderr)
if output_file != sys.stdout:
stats_filename, __ = os.path.splitext(output_file.name)
stats_filename += '.description.csv'
with open(stats_filename, 'w') as stats_file:
df.drop(labels=['fixes', 'timestamp'], axis=1)\
.sort_index(axis=1)\
.describe(percentiles=[.1, .25, .5, .75, .9])\
.to_csv(stats_file)
print('',file=stats_file)
pd.DataFrame(df_sum.sum(),
columns=['sum'])\
.T\
.to_csv(stats_file)
print("Saved description of stats to '%s' (%s)" %
(stats_filename, args_utils.sizeof_fmt(os.path.getsize(stats_filename))),
file=sys.stderr)
def process_data(store, data, repo, count_files=False):
"""Process bug report dataset, computing bug fixing recency/frequency
The input to this function is processed bug report dataset, which
includes all the information about bug reports for fixed bugs. It
assumes that data is an associative array (an object), where the
key is [shortened] SHA-1 identifier of the bugfix commit, and the
data includes 'bug_report' key.
Parameters
----------
store : datastore_utils.Store
Data structure used to store results.
data : dict
Data about bug reports, read from the JSON file.
This structure is modified by this function, getting sorted
and (depending on 'store') getting data added.
repo : str | git.Repo | pygit2.Repository
Pathname to the repository, or either GitPython (git.Repo)
or pygit2 (pygit2.Repository) repository object.
count_files : bool, optional
Whether to find number of files in snapshot for each bugfix
commit (good for statistics). The number of files is taken at
pre-image of bugfix commit, that is the state of the project
at the parent of the bugfix commit (where br(r,s) is evaluated
at). This data is saved under 'n_files' key in stats.
Returns
-------
(store, stats) : (datastore_utils.Store, dict)
A tuple consisting of [augmented] data about bug reports, to be
printed or written to a file [e.g. as JSON], and statistics for
each commit / bug report, to be saved as CSV and/or to generate
summary of overall statistics of bug reports and the repository.
"""
stats = {}
# sort data, as the algorithm needs bug_reports sorted by commit
# date with later fixed bug_reports following earlier ones;
# see the definition of br(r,s) in the docstring for this file
print('Fixing and augmenting commit metadata', file=sys.stderr)
fix_commit_metadata(data) # ensures 'timestamp' for commit
print('Sorting data by bug creation timestamp', file=sys.stderr)
data = sorted_by_bugreport(data, key='timestamp')
# if augmenting original bug report data, preserve sorting result
if isinstance(store, NestedDictStore):
store.update(data)
# list of bugfix commits sorted by the commit timestamp that were
# not yet taken into account / used to compute br(r,s)
br_remaining = sorted(data.keys(),
key=lambda c: int(data[c]['commit']['metadata']['timestamp']))
# HDF5 doesn't have true Unicode support; for PyTables (that's what Pandas
# uses) on Python 2.x using unicode atrings as names of colums or as index
# leads to PerformanceWarning about pickling; let's convert to str
# (i.e. bytes). NOTE: this may change for Python 3.x
#
# NOTE: SHA-1 hash in hex is always in 'ascii' range and encoding
br_remaining = map(str, br_remaining)
br_new = []
print('Computing br(r,t)', file=sys.stderr)
prev_commit = None
br_data = {} # translated to br_last
br_curr = {} # translated to current commit
br_last = None
# DEBUG/debugging (also possibly for 'bitmap' br(r,s) storage)
commit_idx = { commit: idx for (idx,commit) in enumerate(data) }
# debugging strange patterns in the dataset (printing needs also `commit_idx`)
n_suspicious = 0 # number of times br_commit was not in stats
n_bug_time_collisions = 0 # encountered bug reports with the same date (maybe dup)
n_fix_before_report = 0 # bug reports where the fix is earlier than its creation
# TODO: make max_warnings configurable and/or use logging
max_warnings = 40 # max number to be warned, < 0 for unlimited, 0 to supress
for commit in tqdm(data):
# DEBUG
#for (idx, commit) in enumerate(tqdm(data)):
# update/find stats about bugfix commit
commit_full = data[commit]['commit']['metadata']['sha']
fixstatus = retrieve_changes_status(repo, commit_full)
init_bugfix_changes_stats(stats, data, commit,
changes=fixstatus)
create_commit_info_stats(stats, data, commit)
# debugging strange patterns in the dataset
if int(data[commit]['commit']['metadata']['timestamp']) < \
int(data[commit]['bug_report']['timestamp']):
n_fix_before_report += 1
# ensure that there is 'len_br_preceding', even for first bugfix
# which means that there is always 'len_br_preceding' for
# previous bugfix commit (needed to calculate one for current)
stats[commit]['len_br_preceding'] = 0
# if there were no previous bug reports, then br(r,t(r)) is empty
if prev_commit is None:
prev_commit = commit
continue
# find all new commits that may be included in br(r,t(r)), that
# is those that were not yet included in BR(r), and which t(b.f) <= t_r,
# i.e. those that were fixed before current bug report was created
if len(br_new) > 0:
br_last = br_new[-1]
(br_new, br_remaining) = br_candidates(data, br_remaining, commit)
# DEBUG
# TODO: if __debug__: or logger instead of commenting out code
#print('%d remaining + %d candidates = %d vs %d' %
# (len(br_remaining), len(br_new), len(br_remaining)+len(br_new), len(data)))
#print('br_last=%s commit=%s (%d); %d candidates=%s' %
# (br_last, commit, idx, len(br_new), br_new))
# TODO: move to separate function
stats[commit]['len_br_new'] = len(br_new)
stats[commit]['len_br_preceding'] = \
stats[prev_commit]['len_br_preceding'] + stats[commit]['len_br_new']
# TODO?: this may be done by comparing timestamps instead
stats[commit]['has_prev_in_new'] = \
len(br_new) > 0 and prev_commit in br_new
# we might need to translate br to current commit
if not br_new and not br_data:
# no new bugfixes, no br, nothing to translate
continue
# state for bug report is assumed to be parent of bugfix commit
# NOTE: prev_commit cannot be None, otherwise we wouldn't get here
prev_br_commit = br_last
for br_commit in br_new:
# translate to next commit, if there us anything to translate
# and if there was a previous commit to translate from
if br_data and prev_br_commit is not None:
prev_br_commit_full = data[prev_br_commit]['commit']['metadata']['sha']
br_commit_full = data[br_commit]['commit']['metadata']['sha']
# files might have been deleted, added or renamed
diffstatus = retrieve_changes_status(repo,
commit=br_commit_full+'^',
prev=prev_br_commit_full+'^')
changes_stats = \
br_translate(br_data, changes=diffstatus)
# update bugfix commit stats
if br_commit not in stats:
# debugging strange patterns in the dataset
n_suspicious += 1
if int(data[commit]['bug_report']['timestamp']) == \
int(data[br_commit]['bug_report']['timestamp']):
n_bug_time_collisions += 1
# kind of DEBUG, supressable
if max_warnings > 0 and n_suspicious < max_warnings:
print('%s* commit=%s [%d/id=%d] timestamps: bug=%d %s commit=%d' %
(cmp_char(int(data[commit]['bug_report']['timestamp']),
int(data[br_commit]['bug_report']['timestamp'])),
commit, commit_idx[commit],
int(data[commit]['bug_report']['id']),
int(data[commit]['bug_report']['timestamp']),
cmp_char(int(data[commit]['bug_report']['timestamp']),
int(data[commit]['commit']['metadata']['timestamp'])),
int(data[commit]['commit']['metadata']['timestamp'])),
file=sys.stderr)
print('br_commit=%s [%d/id=%d] timestamps: bug=%d %s commit=%d' %
(br_commit, commit_idx[br_commit],
int(data[br_commit]['bug_report']['id']),
int(data[br_commit]['bug_report']['timestamp']),
cmp_char(int(data[br_commit]['bug_report']['timestamp']),
int(data[br_commit]['commit']['metadata']['timestamp'])),
int(data[br_commit]['commit']['metadata']['timestamp'])),
file=sys.stderr)
stats[br_commit] = {}
# actual update
stats[br_commit].update(changes_stats)
# add commit to br
br_commit_full = data[br_commit]['commit']['metadata']['sha']
fixstatus = retrieve_changes_status(repo, br_commit_full)
changes_effective = \
br_add_bugfix(br_data, bugfix=br_commit, changes=fixstatus)
stats[br_commit]['n_changes_eff'] = len(changes_effective)
# update previous
prev_br_commit = br_commit
# translation to current commit
br_curr = deepcopy(br_data)
br_commit_full = data[br_commit]['commit']['metadata']['sha']
diffstatus = retrieve_changes_status(repo,
commit=commit_full+'^',
prev=br_commit_full+'^')
changes_stats = br_translate(br_curr, changes=diffstatus)
stats[commit].update({
'n_ren_to_curr': changes_stats['n_renames_since'],
'n_del_to_curr': changes_stats['n_deletions_since'],
'n_ren_fixed_curr': changes_stats['n_renames_fixed'],
'n_del_fixed_curr': changes_stats['n_deletions_fixed'],
})
# compute and store dependent features
stats_br = \
br_compute_frequency_and_recency(store, br_curr, data, commit)
stats[commit].update(stats_br)
prev_commit = commit
# debugging strange patterns in the dataset
n_total = len(data)
if n_suspicious > 0:
print('%d/%d suspicious bug reports, with br_commit not in stats' %
(n_suspicious, n_total), file=sys.stderr)
if max_warnings > 0:
print('- supressed warnings after %d' % max_warnings,
file=sys.stderr)
if n_bug_time_collisions > 0:
print('%d/%d occurences where two bug reports had the same timestamp' %
(n_bug_time_collisions, n_total), file=sys.stderr)
if n_fix_before_report > 0:
print('%d/%d bug reports was fixed before bug report creation date' %
(n_fix_before_report, n_total), file=sys.stderr)
if count_files:
print('Finding number of files in each bugfix commit', file=sys.stderr)
for commit in tqdm(data):
commit_full = data[commit]['commit']['metadata']['sha']
files = retrieve_files(repo, commit_full+'^')
# TODO?: add this information to data, and save it
stats[commit]['n_files'] = len(files)
stats[commit]['n_files_Java'] = len([f for f in files if f.endswith('.java')])
# TODO: compute mean and/or median length of pathname
else:
prev_commit_full = data[prev_commit]['commit']['metadata']['sha']
print('number of files in repository at last bugfix %s: %d' %
(prev_commit, len(retrieve_files(repo, prev_commit_full))),
file=sys.stderr)
# NOTE: currently data is not getting sorted
return (store, stats)
def br_add_bugfix(fixing_data, bugfix, changes):
"""Update bug fixing recency data with the new bugfix
This function simply adds `bugfix` to the list of bugfixes/commits
for each file that the bugfix modified, that is each file in the
pre-image of `changes`, creating entries if necessary.
Used to extend 'br(r-1;t(r.f^))' with 'r' to 'br(r;t(r.f^))'.
Parameters
----------
fixing_data : dict
Per-file bug fixing info; it should have the following
structure:
{
'<pathname>': [
'<bugfix_commit_1 shortened sha-1>',
'<bugfix_commit_2 shortened sha-1>',
...
],
...
}
bugfix : str
Identifier of the bugfix commit we are (possibly) appending to
'br'; a key to data structure with combined bug report and bug
fixing info.
changes : dict
Result of calling retrieve_changes_status() for bugfix commit;
key is pair of pre-commit and post-commit pathnames, while
the value is one-letter status of changes.
{
(None, '<added file pathname>'): 'A',
('<pathname of file to be deleted>', None): 'D',
('<pathname of modified file>', '<pathname of modified file>'): 'M',
('<pre-change pathname of renamed file>', '<post-change pathname>'): 'R'
}
Returns
-------
list
Effective changes, that is list of pathnames for which given
bugfix changed list of bugfixing (changed fixing information).
Side effects
------------
Changes / updates 'fixing_data' parameter
"""
changes_effective = []
for (src, dst), status in changes.items():
path = src
if status == 'A':
# create info for newly added files
path = dst
elif status == 'D':
# skip deleted files, those would be handled later
# or in other words deleting file means fixing it
continue
# it may be new file, or first time file was fixed
if path not in fixing_data:
# NOTE: we can use dict instead of OrderedDict here
fixing_data[path] = []
# update information
changes_effective.append(path)
fixing_data[path].append(
# information about bug report / bug fixing commit
# this is the very simplest solution: just identifier
bugfix
)
return changes_effective
def br_translate(fixing_data, changes):
"""Translate bug fixing recency data from one snapshot to the other
This means that entries for files that got deleted during
the transition are removed from bug fixing recency data, and
entries for files that got renamed have their pathname (their key)
respectively changed.
In other words, translate 'br(r,before)' to 'br(r,after)',
where changes = diff(before, after)
Parameters
----------
fixing_data : dict
Per-file bug fixing info; it should have the following
structure:
{
'<pathname>': [
'<bugfix_commit_1 shortened sha-1>',
'<bugfix_commit_2 shortened sha-1>',
...
],
...
}
changes : dict
Changes from 'before' commit/snapshot to 'after' commit/snapshot.
Result of retrieve_changes_status(repo, before, after);
key is pair of pre-commit and post-commit pathnames, while
the value is one-letter status of changes.
Returns
-------
dict
Statistics summary for the change
"""
stats = {
'n_renames_since': 0,
'n_deletions_since': 0,
'n_renames_fixed': 0,
'n_deletions_fixed': 0,
}
#
# remove information about files which got deleted
for path in [srcdst[0] # the src pathname
for srcdst, status in changes.items()
if status == 'D']:
stats['n_deletions_since'] += 1
# the deleted file might not have been in fix commit
if path in fixing_data:
del fixing_data[path]
stats['n_deletions_fixed'] += 1
# update information about renamed and copied files
for src, dst in [srcdst # (src_path, dst_path) pair
for srcdst, status in changes.items()
if status in 'RC']:
stats['n_renames_since'] += 1
# renamed file might not have been in bugfix commit
if src not in fixing_data:
continue
if changes[(src,dst)] == 'R':
# move data to the new name, i.e. change the key
# NOTE: this trick works for regular dict only
fixing_data[dst] = fixing_data.pop(src)
stats['n_renames_fixed'] += 1
else: # it is 'C'
# copy data to new name
fixing_data[dst] = deepcopy(fixing_data[src])
# information about new files added only as necessary
return stats
def br_candidates(data, br_remaining, commit):
"""Find candidates not yet included to be added to br(r,*)
Given the list of remaining, that is not yet taken into account
bug fixes, split this list into part that has been created (fixed)
before creation time of given bug report, and those that were
created (fixed) later.
Creation times of bugfix commits, and the date when given bug
report was creates is taken from the augmented combined bugs+fixes
dataset. The list of remaining fixes (as shortened SHA-1
identifiers, which are keys in the dataset to bug report + bug fix
info) needs to be sorted in ascending chronological order of
bugfix commit creation date. Returned lists are also sorted; the
original list is split in two.
Parameters
----------
data : dict | collections.OrderedDict
Combined data about bug reports and bugfix commits, read from
the JSON file.
br_remaining : list
List of remaining keys to data (of shortened SHA-1
ideintifiers of bugfix commits), sorted in the bugfix commit
creation time order. This means that the `commit` timestamp
divides this list into two parts: first that has commit
creation date not later than creation date of given bugfix,
and those that are later.
/-- t(r)
|
[c_0, c_1,...,c_i, | c_{i+1},...,c_{N-1}]
where t(c_j) < t_(c_{j+1}) and t(c_i) < t(r) <= t(c_{i+1}).
commit : str
Identifier of the bug report, all bugfix commits added to the
returned list have commit date not later than bug report
creation date.
TODO?: maybe change the name of this parameter.
Returns
-------
(br_new, br_remaining) : (list, list)
First list in returned pair is the list of bugfix commits from
`br_remaining` with creation time earlier than bug report
creation time of `commit`. Because `br_remaining` is assumed
to be sorted this would be some number of elements from the
start of it, and it would also be sorted. Possibly empty.
Second list in returned pair is the list of remaining bugfix
commits, with creation time later than cration time of given
bug report. Possibly empty.
These two lists are (br_remaining[:i], br_remaining[i:]) for
some value of i.
"""
this_bug_ts = int(data[commit]['bug_report']['timestamp'])
this_fix_ts = int(data[commit]['commit']['metadata']['timestamp'])
commit_list = []
# DEBUG
#print('commit =%s (bug_ts=%d) / bug_id=%d' %
# (commit, this_bug_ts, int(data[commit]['bug_report']['bug_id'])))
# corner cases
if not br_remaining:
# DEBUG
#print('br_candidates: empty list')
# no candidates
return ([], [])
elif this_bug_ts <= int(data[br_remaining[0]]['commit']['metadata']['timestamp']):
# DEBUG
#print('br_candidates: early return %d < %d' %
# (this_bug_ts, int(data[br_remaining[0]]['commit']['metadata']['timestamp'])))
# all commits are later (newer) than given bug
return ([], br_remaining)
elif int(data[br_remaining[-1]]['commit']['metadata']['timestamp']) < this_bug_ts:
# even last commit is earlier (older) than given bug
# NOTE: should never happen in this code
return (br_remaining, [])
for (i,v) in enumerate(br_remaining):
curr_bug_ts = int(data[v]['bug_report']['timestamp'])
curr_fix_ts = int(data[v]['commit']['metadata']['timestamp'])
if not curr_fix_ts < this_bug_ts:
return (br_remaining[:i], br_remaining[i:])
def br_compute_frequency_and_recency(store, bug_fixing_data, data, commit):
"""Compute bug-fixing frequency and bug-fixing recency features
Compute fixing recency:
f_5(r,s) = (r.month - last(r,s).month + 1)^-1
and bug fixing recency:
f_6(r,s) = |br(r,s)|
given br(r,s) data for some bug report / bugfix commit. Bug
report creation date, and the time it was fixed (bugfix commit
creation date) is taken from the dataset.
These computed features, and also some data used to calculate
them, is saved in given data store.
Return statistics about those two features.
TODO: fix order of parameters, rename some of them.
Parameters
----------
store : datastore_utils.Store
Data structure used to store results.
bug_fixing_data : dict
Per-file bug fixing info, used to compute derived features
data : dict
Data about bug reports, read from the JSON file.
commit : str
Identifier of the bugfix commit, relevant for computing recency
value, and where to store results
Side effects
------------
Appends data to store
Returns
-------
dict
Statistics about bug frequency, bug recency and related stuff.
"""
global br_format
stats = {
'files_with_br': 0,
'max_frequency': 0,
}
this_bug_ts = int(data[commit]['bug_report']['timestamp'])
this_fix_ts = int(data[commit]['commit']['metadata']['timestamp'])
for path in bug_fixing_data:
frequency = len(bug_fixing_data[path])
if frequency == 0:
# should not happen
continue
# TODO: better selection, e.g. dict of functions
if br_format == 'list':
store.set_feature(commit, path, 'br', bug_fixing_data[path])
elif br_format == 'bitmap':
store.set_feature(commit, path, 'br',
list_of_bugfixes_to_storage(data,
bug_fixing_data[path]))
elif br_format != 'none':
store.set_br(data, commit, path, 'br', bug_fixing_data[path])
store.set_feature(commit, path, 'frequency', frequency)
stats['files_with_br'] += 1
stats['max_frequency'] = max(stats['max_frequency'],
frequency)
lastbr = bug_fixing_data[path][-1] # TODO: not exactly
last_bug_ts = int(data[lastbr]['bug_report']['timestamp'])
last_fix_ts = int(data[lastbr]['commit']['metadata']['timestamp'])
delta_ts = this_bug_ts - last_fix_ts
# DEBUG (on error)
#if delta_ts < 0:
# print("path='%s'" % (path,))
# print('br.last=%s (tc=%d) curr=%s (tr=%d) delta[s]=%d' %
# (lastbr, last_fix_ts, commit, this_bug_ts, delta_ts))
# sys.exit(0)
store.set_feature(commit, path, 'recency_timedelta[s]', delta_ts)
store.set_feature(commit, path, 'recency[30-day months]',
1.0/((timedelta(seconds=delta_ts).days//30) + 1))
# TODO: use relativedelta or rrule from dateutil
# see https://stackoverflow.com/questions/4039879/best-way-to-find-the-months-between-two-dates
return stats
def init_bugfix_changes_stats(stats, data, commit, changes):
"""Initialize stats related to changes in the bugfix commit
Those statistics include currently:
- n_changes: total number of all changed files
- n_effective: all changes except deletions
- n_renames: all file renames
- n_deletions: all file deletions
- n_additions: all file additions (new files in the commit)
Parameters
----------
stats : dict
Statistics for each commit / bug report, to be saved as CSV
and/or to generate summary of overall statistics of bug
reports and the repository.
data : dict
Combined data about bug reports and bugfix commits, read from
the JSON file.
commit : str
Identifier of the bugfix commit, key to 'stats' structure
to create and/or update.
changes : dict
Result of calling retrieve_changes_status() for a commit;
key is pair of pre-commit and post-commit pathnames, while
the value is one-letter status of changes.
Returns
-------
dict
Updated statistics data
Side effects
------------
Modifies contents of 'stats' parameter
"""
if commit not in stats:
stats[commit] = {}
stats[commit].update({
'n_changes': len(changes),
'n_effective': len([pair for pair, status in changes.items()
if status != 'D']),
'n_renames': len([pair for pair, status in changes.items()
if status == 'R']),
'n_deletions': len([pair for pair, status in changes.items()
if status == 'D']),
'n_additions': len([pair for pair, status in changes.items()
if status == 'A']),
})
return stats
def create_commit_info_stats(stats, data, commit):
"""Create statistics info related to bug and bugfix commit itself
Those statistics and info include currently:
- fixes: bug report id of bug fixed by given commit
- timestamp: bugfix commit creation time
Parameters
----------
stats :
Statistics for each commit / bug report, to be saved as CSV
and/or to generate summary of overall statistics of bug
reports and the repository.
This parameter is modified by this function.
data :
Combined data about bug reports and bugfix commits, read from
the JSON file.
commit :
Identifier of the bugfix commit, key to 'stats' structure
to create and/or update.
Returns
-------
dict
Modified and augmented stats data.
"""
if commit not in stats:
stats[commit] = {}
stats[commit].update({
'fixes': int(data[commit]['bug_report']['bug_id']),
'timestamp': float(data[commit]['commit']['metadata']['timestamp']),