Chapter 3

 

SOURCE CODE METRICS

 

 

            3.1 The form of the programmes

        A programme gets many forms before being run.

                In the conceptual phase of the specifications, the programme takes the form of the variable description, of the processing sequences, results and operating mode.

                The programming techniques develop specific ways of algorithm and data structure description, aiming to find as rigorous and as unambiguous ways as possible.

                Specification making methods are presented in  [  ].

                If the ideal solution has been found, the Varnier techniques, nested logic, top-down would have had a constantly decreasing contribution, down to their elimination. One can define metrics on specifications. However, the variety of the description methods, their inaccuracy and diversity reduce the result comparison possibility.

                Working on the source code could be replaced with an analysis on the pseudo-code form.

                The pseudo-code is an intermediate algorithm description language. There are no standards for such a language, which leads to local presentations and the lack of comparability.

            The following issues justify working on the source code:

¨       There are languages for which standards are defined;

¨       The languages have compilers meeting these standards;

¨       The programming discipline do not allow misunderstandings;

¨       There is a very large number of users;

¨       The wish for portability of the programme systems dictates the use of the features that can be found in all language implementations.

Due to this situation, working on the source code is not altered by subjective fluctuations, and should the indicators be applied to programmes that are currently used, this means that all the terms set by a user are met.

                Language diversity and the large number of programmes categories require different approaches. In the first stage a programmes base in C++, another one in PASCAL, another one in COBOL, and another one in FORTRAN programming languages are made.

                Each programmes base is very homogeneous relatively to the programming language. The ability to compare programmes requires for the group elements to be homogeneous. The exogenous variables must find their corresponding levels in each element.      

                The programmes are transformed from one language into another by defining transforming coefficients. The quality of the respective proportions makes this transformation accurate.

                The analysis of the source code is a way of measuring. However, it is not the only one. It shows the advantage of the possibility to write software for analysis processes automating [Balog].

 

            3.2 The length of the programme measured in the number of instructions

Defining a programming style brings new rules regarding the use of instructions and the expressions making.

                The result is programmers being disciplined so that, if a problem is let known to a group of independently working programmers, the result will be a multitude of programmes which do not differ significantly from one another (from what the use, data types, control variables, executable instructions are concerned).

                The programming style emphasises the programmers’ experience. The best sequences are selected. A general scheme will be defined in order to create a file, and only the fields will differ from one application to another.

                Should the programming style be developed and assimilated, the number of source lines expresses the length of the programme.

                For example, the next rules are set for the programmes written in the C language:

·         The comments are not taken into account when calculating;

·         The accolades describing the blocks take a row each;

·         The assignment expressions take a line each;

·         The multiple assignment expressions take a line;

·         Each instruction defined with key words takes a line;

·         The function heading takes a line;

·         The control variables have a one-letter name;

·         The increasing and decreasing operations are done with unary operators;

·         Line up on levels the instructions belonging to the same block;

 

When applying these rules, the procedure computing the product of two matrixes is:             

                void matprod (int a[M][N], int b[N][K], int c[M][K], int m, int n, int k)

                                {

                                int i,j,p,cij;

                                for (i = 0; i < m, i++)

                                for (j = 0; j < k; j++)

                                                {

                                                cij = 0;

                                                for (p = 0; p < n; p++)

                                                                cij+= a[i,p] * b[p,j];

                                                c[i][j] = cij;

                                                }

                                }

 

                The function choosing the minimum element out of three elements will be:             

 

                float minim (float a, float b, float c)

                                {

                                float min;

                                min = a;

                                if(min > b) min = b;

                                if(min > c) min = c;

                                return (min);

                                }

The sequence inputting the students’ community describing data will be:

 

                struc student

                                {

                                char name[20];

                                int age;

                                float absavg;

                                }

                                stud [20];

                                int i,n;

                                printf (“/n The numbet of students in the class:”);

                                scanf (“%d”, n);

                                for (i = 0; i < n; i++)

                                {

                                printf (“/n The name of the student:”);

                                gets (stud.nume [i]);

                                printf (“/n Age:”);

                                scanf (“%d”,&stud.age [i]));

                                printf (“/n Graduation degree:”);

                                scanf (“%5.2f”,&stud.absavg [i]));

                                }

 

                One can notice the homogeneity of the sequences, due to which the programmers can correctly understand the programme, even at the first reading. It is these rules that make the average number of elements (operands and operators) on a line not to differ too much from one programme to another. Systematically defected programmes are not taken into account. We include here the programmes containing instructions repeating for no reason.

                For example:

 

                int a[3] [3];

                                a[0] [0] = 0;

                                a[0] [1] = 0;

                                .

                                .

                                .

                                a[2] [2] = 0, when the initialisation can be made as follows:

                                for (i = 0; i < 3; i++)

                                for (j = 0; j < 3; j++)

                                                a[i] [j] = 0;

 

                Should the initialisation values be random, one can either make the initialisation at definition time, or input the data from the keyboard.  The repeated assignment of values is not an option.

                So the length of the programme measured as the number of source code lines needs a minimum set of rules to be applied for efficiency purposes.

                Using the programme NRLIN, which starts by normalizing the source code, one can get the indicator of the number of source code lines.

                “To normalize” is to perform a process that applies transformations to a programme written in a certain language, in order to meet the rules dictated by the programming style.

                For example, the programme for choosing the minimum and maximum elements and their positions in an array is given as follows:

               

                maxpoz =minpos = 0;

                max = min = x[0];

                for (i = 1; i < n; i++)   {

                if (min > x[i] {   min = x[i]; minpos = i;}

                if (max < x[i] {   max = x[i]; maxpos = i;}

 

                The normalized programme is:

 

                max = min = x[0];

                maxpos = minpos = 0;

                for (i = 1; i < n; i++)

                {

                                if (min > x[i])

                                {

                                                min = x[i];

                                                minpos = i;

                                }

                                if (max < x[i])

                                {

                                                max = x[i];

                                                maxpos = i;

                                }

                }

                The normalisation rules are no longer local when the NRLIN programme is used before the C/C++ programme base is made.      

                Although the programmers aim to make compact programmes by writing many instructions on a line or increasing the complexity of expressions, normalizing is necessary for comparison purposes.

For example, the function choosing the minimum out of three integers is written in one of the following forms:

 

A)           int minim (int a, int b, int c)

                                {

                                int min;

                                min = a;

                                if (min > b) min = b;

                                if (min > c) min = c;

                                return (min);

                                }

 

B)            int minim (int a, int b, int c)

                                {

                                int min;

                                if (a > b)

                                                if (b > c) min = c;

                                                                else

                                                                                min = b;

                                                else

                                                                if (a > c) min = c;

                                                                                else

                                                                                min = a;

                                                                return (min);

                                }

 

C)            int minim (int a, int b, int c)

                                {

                   int a1;

                                 a1=((a) < (b)) ? (a): (b);

                                 return (((a1) < (c)) ? (a1) : (c))

                                }

                If the versions A, B, C of the programme are saved in files, the source code lines analysis leads to the following results:

 

Programme

The number of un-normalized source code lines

The number of normalized source code lines

A

6

8

B

8

13

C

4

6

Chart 3.1

 

                If the normalized version has the same number of lines like the un-normalized one, then the capacity level is at its maximum.

                The programmes kit takes into account observations regarding the sequences, without making a compromise between the number of instructions and the errors generation in the case of very compact programmes.

                The simple sequences, those with a larger number of source code lines are preferred to the compact, but with the possibility of error generation ones.

 

            3.3. The Length of a Programme Seen as a File

If the programming style is very strict (the case of a training centre of a certain software production unit), then the length of the programme seen as a file is a reasonable measuring unit. It requires:

¨       Defining the heading of a function like a comment, but always on 12 source code lines;

¨       Writing a three line comment every time an important step of the algorithm is taken;

¨       Defining standard sequences for usual processing;

¨       Using programme sequences libraries (with macro definitions);

¨       Dictating an inferior limit level of re-using.

If a new problem is given to a group of programmers, more than 80 per cent of the suggested solutions are identical. The teams of programmers are very homogeneous. The length of the file is expressed in bytes. Its homogeneity is expressed by:

·        The blanks percentage (all of them write the instructions on the line identically);

·        The intra-sequence and inter-sequence comments percentage;

·        Key-words percentage;

·        Operators and operands percentage;

These percentage structures do not differ significantly from one file to another, in the programmes base of a certain language.

                The functions:

 a) Scalar product of two arrays of vectors:

                int scalprod (int x[ ], int y[ ], int n)

                                {

                                int s, i;

                                /* the n elements of the array are multiplied */

                                for (s = 0, i = 0; i(n; i++)

                                                s+ = x[i] * y[i];

                                return (s);

                                }

 

b) Addition of two matrixes

 

                void matadd (int a[ ][N],int b[ ][N], int c[ ][N], int n, int m)

                                {

                                int j, i;

                                /* the m lines of the matrixes a and b are gone over */

                                for (i = 0; i<m; i++)

                                /* the elements on the line i of the matrixes a and b are added */

                                for (j = 0; j<n; j++)

                                c[i][j] = a[i][j] + b[i][j];

                                }

 

c) Choice of minimum out of a matrix

 

                int matmin (int a[ ][N], int n , int m)

                                {

                                int j, i, min;

                                min = a[0][0];

                                /* the m lines of the matrix a are gone over */

                                for (i = 0; i<m; i++)

                                /* the line i is searched for the minimum element */

                                for (j = 0; j<n; j++);

                                                if(min>a[i][j])min = a[i][j];

                                return (min);

                                }

 

d) Function for generation of the an term of the Fibonacci array (without recursion) is:                        ak = ak-1 + ak-2, k = 2,3...

                                a0 = 0; a1 =1

 

                void fibonacci (int n)

                                {

                                int i, ai, aimin1, aimin2;

                                aimin2 = 0;

                                aimin1 = 1;

                                /* Testing the value of n*/

                                if (n == 0 || n == 1)printf(“/n n must be > =2 “);

                                                else

                                /* Computing ak = ak -1 + ak -2;   k =2,3….*/

                                for (i = 2;i < =n; i++)

                                                {

                                                ai = aimin1 + 1imin2;

                                                printf(“/n a%d = %d”, i, ai);

                                                aimin1 = ai;

                          aimin2 = aimin1;

                                                }

                                }

                One can notice the strictness of applying the rules dictated by a programming style in making the four functions.  The results are the data in chart 3.2:

 

Function

(Programme)

The length of the file (bytes)

The number of source code lines

The number of bytes per instruction

The number of words per source code line

0

1

2

3

4

A

176

8

176/8

45/8

B

272

9

272/9

66/9

C

295

11

295/11

69/11

D

376

17

376/17

87/17

Chart 3.2

 

                By word we mean:

·        Key words;

·        Variable names;

·        Separators;

·        Operators.

                For example:

                For the function scalprod there are:

                                No. of key words….                                                              7

                                No. of operators.....                      .. ++ * (   ) ] + = <           21

                                No. of separators  },{, comma and ;  ................................. 11

                                Variable names.................................................................... 6

                                                                                                                       __________________

                                                                                                                        Total                45

               

One can notice that, for these functions the relative values from columns 3 and 4 of the chart 3.2 differ significantly, so the length of the file can be got by applying a proportion coefficient to the number of source code lines.

                The statistic analysis of the columns 3 and 4 can emphasise the stability / instability of the programming style, so it can also show how representative the indicator of the file length is.

                A function is attached to the programme NRLIN in order to get the length of the file and to obtain the chart 3.2 indicators.

 

            3.4 The Halstead Metric

                Any programming language is defined by:

¨        Un-running (declarative) instructions definition;

¨        Executable instructions;

¨        Handling the operators and operands within expressions;

¨        The programmes are made up of instructions, written in sequences, without taking into account the running order;

¨        The programmes are considered homogeneous, without taking into account the made processing.

Although data types require assigning different memory areas, all the operands can be considered homogeneous, and counted globally. Though operators involve different numbers of machine cycles, they will be considered homogeneous and counted globally.

                Although the key words corresponding to executable instructions have different length compiling sequences, they will be considered homogeneous, and counted globally.

                These simplifying hypotheses are specific to the Halstead metric.

                Let’s consider:

                                n1 – the number of distinct operands;

                                n2 – the number of distinct operators;

                The following relations are defined:

·       n  = n1 + n2, where n is the vocabulary used in the programme;

·       N = N1 +N2, where:

                                                N1 – the total number of operands in the programme;

N2 – the total number of apparitions of the operators in the       programme;

                                                N – the observed length of the programme.

·       C = n1 log2 n1 + n2 log2 n2, where C is the evaluated complexity of the programme;

·       V = N log2 n, where V is the volume of the programme;

·       K = , where K is the transparency of the programme;

·       L = , where L is the implementation level;

·       D = , where D is the difficulty.

                The programme PPP.CPP automates the calculation of these indicators for C++ programmes.

                One can notice that the Halstead metric considers a programme to be an ensemble of parts, without taking into account the interactions and the running mode of the instructions.

                For example, the sequences:

s = 0;

i  = 0;

x[ i ]  = 1;

 

 

for (  i = 1; i < n; i ++)

     {

       x [ i ] = ( i + 3 ) * ( i + 3 );

       s+ = x [ i ];

     }

 

s = 0;

for ( i = 0; i < n; i ++)

     {

y [ i ]  = 0;

z [ i ] = i;

       x [ i ] = ( i + 3 ) * ( i + 3 );

       s+ = x [ i ];

     }

Sequence a)

 

Sequence b)

 

Although different in meaning, if taken into account the running and the results, when they are applied to the programme PPP.CPP lead to the results in the chart 3.3.

 

Indicator

Sequence a)

Sequence b)

n1

 

 

n2

 

 

N1

 

 

N2

 

 

V

 

 

C

 

 

D

 

 

Chart 3.3

 

                Although the instructions are not commutative, the Halstead metric does not show commutativness, it is for this particular reason that it is applied to accurate and optimal programmes in order to made representative comparisons.

                For example, the sequences:

 

s = 0

a = 0;

for ( i = 0; i < n; i ++)

     {

       s+ = x [ i ];

     }

 

s = 0

for ( i = 0; i < n; i ++)

     {

       a = 0;

       s+ = x [ i ];

     }

Sequence a)

 

Sequence b)

 

Although identical from what the Halstead metric is concerned, the non-optimality of the b) sequences not emphasized.

 

            3.5 Diversity Indicators

                The Halstead metric identifies n1 – the number of distinct operands and n2 – the number of distinct operators. When presenting the languages, slight meaning differences emerge even within the group of operands. Considering this, detailing the groups is a must, which means that a transition from the two groups to a larger number of groups reflecting these slight differences takes place.

                Let’s consider:

                                fi – the number of i elements present in the programme;

                                m – the number of types (m = 2 corresponds to the Halstead metric);

                The relation:

                                S =

Measures up the diversity (Tovissi & Taylor).

                In [Ivan, Popescu, Nica] for the C programming language, m = 7, where:

                                f1 -

                                                f2 -

                                                f3 -

                                                f4 -                

                                                f5 -

                                                f6 -

                                                f7 -

                According to this let S be an m variable function.

                                x – array of m components ( x1, x2, ......., xm );

                                S [x] = ;

                                With xi >0 ; xi Î (0, K ]

                                Does function S [x] allow a maximum value for x1 = x2 = ... xm =?

In order to make sure that the programmes are compatible, normalization is a must. A method of normalization is:

                                s [x] =

 

                Another method is:

                                s [x] =

                For the sequences:

 

S = 0;

p =0;

for (i = 0; i< n; i++)

s+ = x[i];

for (i = 0; i< n; i++)

p+ = y[i];

 

s = 0;

p =0;

for (i = 0; i< n; i++)

{

s+ = x[i];

p+ = y[i];

}

Sequence a)

 

Sequence b)

 

The programmes PPP and PP1 implementing this definition will lead to the results given in chart 3.4.

 

The compared analysis of the source code metrics

Indicator

Halstead metric

Diversity metric

C

 

 

V

 

 

 

 

 

Chart 3.4

 

            3.6 Statistic analysis of the source code

                Variables are defined and numbers of using (in expressions, parameter lists) in the source code are counted.

                The degree of using is defined as a ratio between the number of apparitions of a variable and the total number of apparitions of all the variables. This way the definitions of the intense used or not used variables are emphasised. a, which is a relative measure of the deviation from the average will show even programming lacks that have not been brought forward by the compilers.

                a =, Where:

                K – The number of variables defined in the programme;

                - The average value of the number of the variables use in the programme;

                - The frequency of the variable i in the programme.

a2 + a3 ; c > 0

a2 + a4 ; c = 0

a3 + a4 ; c < 0

 
               

For example, the functions for the evaluation of the expression

                                              

            f =                        are:

a)             int calcul1 (int a, int c)

                                {

                                 int f;

                                 if (c > 0)

                                                f= a*a + a*a*a;

                                if (c = 0)

                                                f= a*a + a*a*a*a;

                                if (c < 0)

                                                f = a*a*a +a *a*a*a;

                                return (f);

                }

 

b)            int calcul2 (int a, int c)

                                {

                                int f, b, d, e;

                                b = a*a;

                                d = b*a;

                                e = d*a;

                                if ( c > 0)

                                f = b + d;

                                if (c = 0)

                                f = b+e;

                                if (c < 0)

                                f = e + d;

                                 return (f);

                                }

c)             int calcul3 (int a, int c)

                                {

                                int f, b;

                                b =a*a;

                                if ( c > 0)

                                f = b + b*a;

                                else

                                if (c = 0)

                                f = b+b*b;

                                 else

                                f = a*b+b*b;

                                 return (f);

                                }

 

The chart 3.5 of the variable utilization frequencies is made. These frequencies regard only the executable instructions in a function.

 

Indicators

Appearance frequencies

 

Calculation 1

Calculation 2

Calculation 3

c

3

3

3

a

18

4

4

f

4

4

4

b

 

4

9

d

 

4

 

e

 

4

 

The total of appearances

 

3

 

Average appearances

25

22

20

a

25/3

22/6

20/4

Chart 3.5

 

                The result is an operands use intensity difference. However, version calcul3 ensures equilibrium between the number of operands and the intensity of their use.

                All the types of indicators of the applied statistic can be defined on to the programmes regarded as groups of instructions (each having is position in the programme).

 

            3.7. The Volume of the Executable Instructions

                Like the case of the Hallstead metric, the hypothesis that the instructions in the programme are homogeneous is made, although there are big differences in meaning between the instruction a = 0 and the instruction if(a > b) s+= x[i].

                As we consider the executable instructions to be identical, the operations volume first comes up as expressed by the number of run (executed) instructions.

                For the following sequence:

 

                                a = 0;

                                b = 2;

                                C = m*n +x [i];

The volume V = 3, as the sequence contains three instructions, each of them run once.

                For the sequence:

                                p = 0;

                                s = 0;

                                for (i = 0; i < n; i++ )

                                {

                                s+ = x[i];

                                p + = x[i]* x[i];

                                }

                The volume V = 3n + 2,

Where:

                p = 0;

                s = 0;

are each counted once, and the instructions:

                for(i = 0; i < n; i++)

                s+ = x[i];

                p + = x[i]* x[i];

are each run n times within the repetitive structure.

                The instruction for is considered unitary, though it contains more than one operation. When one uses the instruction if, the following probabilities can be defined: p (if the evaluated expression is true) and q- if the evaluated expression is not true (p+q=1).

 

In the sequence:

                s=0;

                p=0;

                q=0;

                if (a > b)

                {

                s+ = x[i];

                p+ = x[i]*x[i];

                }

                else

                {s+ = y[i]; p+ = x[i]* y[i]; q> = y[i] * y[i]; }

The volume is 4 + p*2 + q*3,

                The following instructions:

                                s = 0;

                                p = 0

                                q = 0;

                                if(a > b)

are counted once. On the true branch of the alternative structure are two instructions that are applied the ratio p, and on the third branch of the structure there are no instructions to be applied the ratio q.

These rules are applied to the aggregated sequences, from simple to complex items. In the sequence:

min = x [0];

                minpos = 0;

                for (i = 1; i < n; i++)

                if (min > x[i])

                                {

                                min = x[i];

                                minpos = i;

                                }

The processing volume will be:

                V = 2 + (n-1) [1+1+2p].

                The instructions min = x[0] and minpos =0 are each run once. The instructions for and if are both run (n-1) times. The instructions min =x[i] and minpos = i are each run (n-1) p times.

                This volume is, by far not an exact evaluation of the number of instructions run within the programme. The evaluation of the volume as a symbolic expression is used in order to evaluate the performances of some programmes.

                For example, the following sequences:

 

s = 0;

p =0;

for (i = 0; i< n; i++)

s+ = x[i];

for (i = 0; i< n; i++)

p+ = y[i];

 

s = 0;

p =0;

for (i = 0; i< n; i++)

{

s+ = x[i];

p+ = y[i];

}

Sequence a)

 

Sequence b)

Lead to the respective volumes:

                va = 2 + 2n + 2n = 2 + 4n

                vb = 2 + 3n

                By comparing vb - va = n > 0, it results that sequence a) is better than sequence b) from what the executable instructions are concerned.

            The executable instructions volume is also adjusted by taking into account different p and q probabilities, as they result from experiments.