Chapter 3
SOURCE CODE METRICS
3.1
The form of the
programmes
A
programme gets many forms before being run.
In the conceptual phase of the
specifications, the programme takes the form of the variable description, of
the processing sequences, results and operating mode.
The programming techniques
develop specific ways of algorithm and data structure description, aiming to
find as rigorous and as unambiguous ways as possible.
Specification making methods are
presented in [ ].
If the ideal solution has been
found, the Varnier techniques, nested logic, top-down would have had a
constantly decreasing contribution, down to their elimination. One can define
metrics on specifications. However, the variety of the description methods,
their inaccuracy and diversity reduce the result comparison possibility.
Working on the source code could
be replaced with an analysis on the pseudo-code form.
The pseudo-code is an
intermediate algorithm description language. There are no standards for such a
language, which leads to local presentations and the lack of comparability.
The
following issues justify working on the source code:
¨
There are languages for which standards are defined;
¨
The languages have compilers meeting these
standards;
¨
The programming discipline do not allow
misunderstandings;
¨
There is a very large number of users;
¨
The wish for portability of the programme systems
dictates the use of the features that can be found in all language
implementations.
Due to this
situation, working on the source code is not altered by subjective
fluctuations, and should the indicators be applied to programmes that are
currently used, this means that all the terms set by a user are met.
Language diversity
and the large number of programmes categories require different approaches. In
the first stage a programmes base in C++, another one in PASCAL, another one in
COBOL, and another one in FORTRAN programming languages are made.
Each programmes base is very
homogeneous relatively to the programming language. The ability to compare
programmes requires for the group elements to be homogeneous. The exogenous
variables must find their corresponding levels in each element.
The programmes are transformed
from one language into another by defining transforming coefficients. The quality
of the respective proportions makes this transformation accurate.
The analysis of the source code
is a way of measuring. However, it is not the only one. It shows the advantage
of the possibility to write software for analysis processes automating [Balog].
3.2
The length of the
programme measured in the number of instructions
Defining
a programming style brings new rules regarding the use of instructions and the
expressions making.
The result is programmers being
disciplined so that, if a problem is let known to a group of independently
working programmers, the result will be a multitude of programmes which do not
differ significantly from one another (from what the use, data types, control
variables, executable instructions are concerned).
The programming style emphasises
the programmers’ experience. The best sequences are selected. A general scheme
will be defined in order to create a file, and only the fields will differ from
one application to another.
Should the programming style be
developed and assimilated, the number of source lines expresses the length of
the programme.
For example, the next rules are
set for the programmes written in the C language:
·
The comments are not taken into account when
calculating;
·
The accolades describing the blocks take a row each;
·
The assignment expressions take a line each;
·
The multiple assignment expressions take a line;
·
Each instruction defined with key words takes a
line;
·
The function heading takes a line;
·
The control variables have a one-letter name;
·
The increasing and decreasing operations are done
with unary operators;
·
Line up on levels the instructions belonging to the
same block;
When applying these rules,
the procedure computing the product of two matrixes is:
void matprod (int a[M][N], int
b[N][K], int c[M][K], int m, int n, int k)
{
int i,j,p,cij;
for (i = 0; i
< m, i++)
for (j = 0; j
< k; j++)
{
cij
= 0;
for
(p = 0; p < n; p++)
cij+=
a[i,p] * b[p,j];
c[i][j]
= cij;
}
}
The function choosing the
minimum element out of three elements will be:
float minim (float a, float b,
float c)
{
float min;
min = a;
if(min > b)
min = b;
if(min > c)
min = c;
return (min);
}
The
sequence inputting the students’ community describing data will be:
struc student
{
char name[20];
int age;
float absavg;
}
stud [20];
int i,n;
printf (“/n The numbet of students in the class:”);
scanf (“%d”, n);
for (i = 0; i
< n; i++)
{
printf (“/n The name of the student:”);
gets (stud.nume
[i]);
printf
(“/n Age:”);
scanf
(“%d”,&stud.age [i]));
printf
(“/n Graduation degree:”);
scanf
(“%5.2f”,&stud.absavg [i]));
}
One can notice the homogeneity
of the sequences, due to which the programmers can correctly understand the
programme, even at the first reading. It is these rules that make the average
number of elements (operands and operators) on a line not to differ too much
from one programme to another. Systematically defected programmes are not taken
into account. We include here the programmes containing instructions repeating
for no reason.
For example:
int a[3] [3];
a[0] [0] = 0;
a[0] [1] = 0;
.
.
.
a[2] [2] = 0,
when the initialisation can be made as follows:
for (i = 0; i
< 3; i++)
for (j = 0; j
< 3; j++)
a[i]
[j] = 0;
Should the initialisation values
be random, one can either make the initialisation at definition time, or input
the data from the keyboard. The
repeated assignment of values is not an option.
So the length of the programme
measured as the number of source code lines needs a minimum set of rules to be
applied for efficiency purposes.
Using the programme NRLIN, which
starts by normalizing the source code, one can get the indicator of the number
of source code lines.
“To normalize” is to perform a
process that applies transformations to a programme written in a certain
language, in order to meet the rules dictated by the programming style.
For example, the programme for
choosing the minimum and maximum elements and their positions in an array is
given as follows:
maxpoz =minpos = 0;
max = min = x[0];
for (i = 1; i < n; i++) {
if (min > x[i] { min = x[i]; minpos = i;}
if (max < x[i] { max = x[i]; maxpos = i;}
The normalized programme is:
max = min = x[0];
maxpos = minpos = 0;
for (i = 1; i < n; i++)
{
if (min >
x[i])
{
min
= x[i];
minpos
= i;
}
if (max <
x[i])
{
max
= x[i];
maxpos
= i;
}
}
The normalisation rules are no
longer local when the NRLIN programme is used before the C/C++ programme base
is made.
Although the programmers aim to
make compact programmes by writing many instructions on a line or increasing
the complexity of expressions, normalizing is necessary for comparison
purposes.
For
example, the function choosing the minimum out of three integers is written in
one of the following forms:
A) int minim (int a, int b, int c)
{
int min;
min = a;
if (min > b)
min = b;
if (min > c)
min = c;
return (min);
}
B) int minim (int a, int b, int c)
{
int min;
if (a > b)
if
(b > c) min = c;
else
min
= b;
else
if
(a > c) min = c;
else
min
= a;
return
(min);
}
C) int minim (int a, int b, int c)
{
int a1;
a1=((a) < (b)) ? (a): (b);
return (((a1) < (c)) ? (a1)
: (c))
}
If the versions A, B, C of the
programme are saved in files, the source code lines analysis leads to the
following results:
|
Programme |
The
number of un-normalized source code lines |
The
number of normalized source code lines |
|
A |
6 |
8 |
|
B |
8 |
13 |
|
C |
4 |
6 |
Chart 3.1
If the normalized version has
the same number of lines like the un-normalized one, then the capacity level is
at its maximum.
The
programmes kit takes into account observations regarding the sequences, without
making a compromise between the number of instructions and the errors
generation in the case of very compact programmes.
The simple sequences, those with
a larger number of source code lines are preferred to the compact, but with the
possibility of error generation ones.
3.3.
The Length of a
Programme Seen as a File
If
the programming style is very strict (the case of a training centre of a
certain software production unit), then the length of the programme seen as a
file is a reasonable measuring unit. It requires:
¨
Defining the heading of a function like a comment,
but always on 12 source code lines;
¨
Writing a three line comment every time an important
step of the algorithm is taken;
¨
Defining standard sequences for usual processing;
¨
Using programme sequences libraries (with macro
definitions);
¨
Dictating an inferior limit level of re-using.
If a new problem is
given to a group of programmers, more than 80 per cent of the suggested
solutions are identical. The teams of programmers are very homogeneous. The
length of the file is expressed in bytes. Its homogeneity is expressed by:
·
The blanks percentage (all of them write the
instructions on the line identically);
·
The intra-sequence and inter-sequence comments
percentage;
·
Key-words percentage;
·
Operators and operands percentage;
These
percentage structures do not differ significantly from one file to another, in
the programmes base of a certain language.
The
functions:
a) Scalar product of two arrays of vectors:
int scalprod (int x[ ], int y[
], int n)
{
int s, i;
/* the n
elements of the array are multiplied */
for
(s = 0, i = 0; i(n; i++)
s+
= x[i] * y[i];
return (s);
}
b)
Addition of two matrixes
void matadd (int a[ ][N],int b[
][N], int c[ ][N], int n, int m)
{
int j, i;
/* the m lines
of the matrixes a and b are gone over */
for (i = 0; i<m; i++)
/* the elements
on the line i of the matrixes a and b are added */
for (j = 0; j<n; j++)
c[i][j]
= a[i][j] + b[i][j];
}
c) Choice
of minimum out of a matrix
int matmin (int a[ ][N], int n ,
int m)
{
int j, i, min;
min = a[0][0];
/* the m lines
of the matrix a are gone over */
for (i = 0; i<m; i++)
/* the line i is
searched for the minimum element */
for (j = 0; j<n; j++);
if(min>a[i][j])min
= a[i][j];
return (min);
}
d)
Function for generation of the an term of the Fibonacci array (without
recursion) is: ak =
ak-1 + ak-2, k = 2,3...
a0 = 0; a1 =1
void fibonacci (int n)
{
int i, ai,
aimin1, aimin2;
aimin2 = 0;
aimin1 = 1;
/* Testing the
value of n*/
if (n == 0 || n
== 1)printf(“/n n must be > =2 “);
else
/* Computing ak
= ak -1 + ak -2; k =2,3….*/
for (i = 2;i
< =n; i++)
{
ai
= aimin1 + 1imin2;
printf(“/n
a%d = %d”, i, ai);
aimin1
= ai;
aimin2 = aimin1;
}
}
One can notice the strictness of
applying the rules dictated by a programming style in making the four
functions. The results are the data in
chart 3.2:
|
Function (Programme) |
The
length of the file (bytes) |
The
number of source code lines |
The
number of bytes per instruction |
The
number of words per source code line |
|
0 |
1 |
2 |
3 |
4 |
|
A |
176 |
8 |
176/8 |
45/8 |
|
B |
272 |
9 |
272/9 |
66/9 |
|
C |
295 |
11 |
295/11 |
69/11 |
|
D |
376 |
17 |
376/17 |
87/17 |
Chart 3.2
By word we mean:
·
Key words;
·
Variable names;
·
Separators;
·
Operators.
For example:
For the function scalprod there
are:
No. of key
words…. 7
No. of
operators..... ..
++ * ( ) ] + = < 21
No.
of separators },{, comma and ; ................................. 11
Variable
names.................................................................... 6
__________________
Total 45
One
can notice that, for these functions the relative values from columns 3 and 4
of the chart 3.2 differ significantly, so the length of the file can be got by
applying a proportion coefficient to the number of source code lines.
The statistic analysis of the
columns 3 and 4 can emphasise the stability / instability of the programming
style, so it can also show how representative the indicator of the file length
is.
A function is attached to the
programme NRLIN in order to get the length of the file and to obtain the chart
3.2 indicators.
3.4
The Halstead Metric
Any programming language is
defined by:
¨
Un-running (declarative) instructions definition;
¨
Executable instructions;
¨
Handling the operators and operands within
expressions;
¨
The programmes are made up of instructions, written
in sequences, without taking into account the running order;
¨
The programmes are considered homogeneous, without
taking into account the made processing.
Although
data types require assigning different memory areas, all the operands can be
considered homogeneous, and counted globally. Though operators involve
different numbers of machine cycles, they will be considered homogeneous and
counted globally.
Although the key words
corresponding to executable instructions have different length compiling
sequences, they will be considered homogeneous, and counted globally.
These simplifying hypotheses are
specific to the Halstead metric.
Let’s consider:
n1 –
the number of distinct operands;
n2 –
the number of distinct operators;
The following relations are
defined:
·
n = n1
+ n2, where n is the vocabulary used in the programme;
·
N = N1 +N2, where:
N1
– the total number of operands in the programme;
N2 – the total number of apparitions
of the operators in the programme;
N
– the observed length of the programme.
·
C = n1 log2 n1 +
n2 log2 n2, where C is the evaluated
complexity of the programme;
·
V = N log2 n, where V is the volume of
the programme;
·
K =
, where K is the transparency of the programme;
·
L =
, where L is the implementation level;
·
D =
, where D is the difficulty.
The programme PPP.CPP automates
the calculation of these indicators for C++ programmes.
One can notice that the Halstead
metric considers a programme to be an ensemble of parts, without taking into
account the interactions and the running mode of the instructions.
For example, the sequences:
|
s = 0; i = 0; x[ i ] =
1; for ( i =
1; i < n; i ++) { x [
i ] = ( i + 3 ) * ( i + 3 ); s+ =
x [ i ]; } |
|
s = 0; for ( i = 0; i < n; i ++) { y [ i ] =
0; z [ i ] = i; x [
i ] = ( i + 3 ) * ( i + 3 ); s+ =
x [ i ]; } |
|
Sequence
a) |
|
Sequence
b) |
Although
different in meaning, if taken into account the running and the results, when
they are applied to the programme PPP.CPP lead to the results in the chart 3.3.
|
Indicator |
Sequence
a) |
Sequence
b) |
|
n1 |
|
|
|
n2 |
|
|
|
N1 |
|
|
|
N2 |
|
|
|
V |
|
|
|
C |
|
|
|
D |
|
|
Chart 3.3
Although the instructions are
not commutative, the Halstead metric does not show commutativness, it is for this particular reason that it is applied
to accurate and optimal programmes in order to made representative comparisons.
For example, the sequences:
|
s = 0 a = 0; for ( i = 0; i < n; i ++) { s+ =
x [ i ]; } |
|
s = 0 for ( i = 0; i < n; i ++) { a =
0; s+ =
x [ i ]; } |
|
Sequence
a) |
|
Sequence
b) |
Although identical from what the Halstead
metric is concerned, the non-optimality of the b) sequences not emphasized.
3.5 Diversity Indicators
The Halstead metric identifies n1
– the number of distinct operands and n2 – the number of
distinct operators. When presenting the languages, slight meaning differences
emerge even within the group of operands. Considering this, detailing the
groups is a must, which means that a transition from the two groups to a larger
number of groups reflecting these slight differences takes place.
Let’s consider:
fi –
the number of i elements present in the programme;
m – the number
of types (m = 2 corresponds to the Halstead metric);
The relation:
S =![]()
Measures
up the diversity (Tovissi & Taylor).
In [Ivan, Popescu, Nica] for the
C programming language, m = 7, where:
f1 -
f2
-
f3
-
f4
-
f5
-
f6
-
f7
-
According to this let S be an m
variable function.
x – array of m
components ( x1, x2, ......., xm );
S [x] =
;
With xi >0
; xi Î (0, K ]
Does function S
[x] allow a maximum value for x1 = x2 = ... xm
=?
In order
to make sure that the programmes are compatible, normalization is a must. A
method of normalization is:
s [x] = 
Another method is:
s [x] = 
For the sequences:
|
S = 0; p =0; for (i
= 0; i< n; i++) s+ =
x[i]; for (i
= 0; i< n; i++) p+ = y[i]; |
|
s = 0; p =0; for (i
= 0; i< n; i++) { s+ =
x[i]; p+ =
y[i]; } |
|
Sequence
a) |
|
Sequence
b) |
The programmes
PPP and PP1 implementing this definition will lead to the results given in
chart 3.4.
The
compared analysis of the source code metrics
|
Indicator |
Halstead
metric |
Diversity
metric |
|
C |
|
|
|
V |
|
|
|
|
|
|
Chart 3.4
3.6 Statistic analysis of the
source code
Variables are defined and
numbers of using (in expressions, parameter lists) in the source code are
counted.
The degree of using is defined
as a ratio between the number of apparitions of a variable and the total number
of apparitions of all the variables. This way the definitions of the intense
used or not used variables are emphasised. a,
which is a relative measure of the deviation from the average will show even
programming lacks that have not been brought forward by the compilers.
a
=
, Where:
K – The number of variables
defined in the programme;
- The average value of the number of the variables use in the
programme;
- The frequency of the variable i in the programme.
a2 + a3 ; c > 0 a2 + a4 ; c = 0 a3 + a4 ; c < 0
For example,
the functions for the evaluation of the expression
f =
are:
a) int
calcul1 (int a, int c)
{
int f;
if (c > 0)
f=
a*a + a*a*a;
if
(c = 0)
f=
a*a + a*a*a*a;
if
(c < 0)
f
= a*a*a +a *a*a*a;
return
(f);
}
b) int
calcul2 (int a, int c)
{
int
f, b, d, e;
b
= a*a;
d
= b*a;
e
= d*a;
if
( c > 0)
f
= b + d;
if
(c = 0)
f
= b+e;
if
(c < 0)
f
= e + d;
return (f);
}
c) int
calcul3 (int a, int c)
{
int
f, b;
b
=a*a;
if
( c > 0)
f
= b + b*a;
else
if
(c = 0)
f
= b+b*b;
else
f
= a*b+b*b;
return (f);
}
The chart
3.5 of the variable utilization frequencies is made. These frequencies regard
only the executable instructions in a function.
|
Indicators |
Appearance
frequencies |
||
|
|
Calculation
1 |
Calculation
2 |
Calculation
3 |
|
c |
3 |
3 |
3 |
|
a |
18 |
4 |
4 |
|
f |
4 |
4 |
4 |
|
b |
|
4 |
9 |
|
d |
|
4 |
|
|
e |
|
4 |
|
|
The
total of appearances |
|
3 |
|
|
Average
appearances |
25 |
22 |
20 |
|
a |
25/3 |
22/6 |
20/4 |
Chart 3.5
The result is an operands use
intensity difference. However, version calcul3 ensures equilibrium between the
number of operands and the intensity of their use.
All the types of indicators of
the applied statistic can be defined on to the programmes regarded as groups of
instructions (each having is position in the programme).
3.7. The Volume of the
Executable Instructions
Like the case of the Hallstead
metric, the hypothesis that the instructions in the programme are homogeneous
is made, although there are big differences in meaning between the instruction
a = 0 and the instruction if(a > b) s+= x[i].
As we consider the executable
instructions to be identical, the operations volume first comes up as expressed
by the number of run (executed) instructions.
For the following sequence:
a = 0;
b = 2;
C = m*n +x [i];
The
volume V = 3, as the sequence contains three instructions, each of them run
once.
For the sequence:
p = 0;
s = 0;
for (i = 0; i
< n; i++ )
{
s+ = x[i];
p + = x[i]*
x[i];
}
The volume V = 3n + 2,
Where:
p = 0;
s = 0;
are each
counted once, and the instructions:
for(i = 0; i < n; i++)
s+ = x[i];
p + = x[i]*
x[i];
are each
run n times within the repetitive structure.
The instruction for is
considered unitary, though it contains more than one operation. When one uses
the instruction if, the following probabilities can be defined: p (if
the evaluated expression is true) and q- if the evaluated expression is not
true (p+q=1).
In
the sequence:
s=0;
p=0;
q=0;
if (a > b)
{
s+ = x[i];
p+
= x[i]*x[i];
}
else
{s+
= y[i]; p+ = x[i]* y[i]; q> = y[i] * y[i]; }
The
volume is 4 + p*2 + q*3,
The following instructions:
s = 0;
p = 0
q = 0;
if(a > b)
are
counted once. On the true branch of the alternative structure are two
instructions that are applied the ratio p, and on the third branch of the
structure there are no instructions to be applied the ratio q.
These
rules are applied to the aggregated sequences, from simple to complex items. In
the sequence:
min
= x [0];
minpos = 0;
for (i = 1; i < n; i++)
if (min > x[i])
{
min = x[i];
minpos = i;
}
The
processing volume will be:
V = 2 + (n-1) [1+1+2p].
The instructions min = x[0] and
minpos =0 are each run once. The instructions for and if are both
run (n-1) times. The instructions min =x[i] and minpos = i are each run (n-1) p
times.
This volume is, by far not an
exact evaluation of the number of instructions run within the programme. The
evaluation of the volume as a symbolic expression is used in order to evaluate
the performances of some programmes.
For example, the following
sequences:
|
s = 0; p =0; for (i
= 0; i< n; i++) s+ =
x[i]; for (i
= 0; i< n; i++) p+ = y[i]; |
|
s = 0; p =0; for (i
= 0; i< n; i++) { s+ =
x[i]; p+ =
y[i]; } |
|
Sequence
a) |
|
Sequence
b) |
Lead to
the respective volumes:
va = 2 + 2n + 2n = 2 + 4n
vb = 2 + 3n
By comparing vb - va = n > 0,
it results that sequence a) is better than sequence b) from what the executable
instructions are concerned.
The
executable instructions volume is also adjusted by taking into account
different p and q probabilities, as they result from experiments.