Dissimilarity measures for mixed attribute types are crucial in data analysis, allowing the quantification of differences between data points, regardless of their attribute types. Calculating dissimilarity matrices is essential for tasks like clustering and classification, accurately measuring proximity between different types of objects, and revealing valuable data patterns and structures. These measures and matrices collectively form the foundation for uncovering insights, making informed decisions, and enhancing overall data comprehension in the analysis process.
Suppose we have a table with five products, each assigned one of three priorities: Urgent (assigned the ordinal value of 3), High Priority (assigned the ordinal value of 2), and Low Priority (assigned the ordinal value of 1). This table also includes their values in numeric forms. The table is as follows:
Object Identifier | Test I (Nominal) | Test II (Ordinal) | Test III (Numeric) |
1 | Product A | Low Priority | 45 |
2 | Product B | Urgent | 93 |
3 | Product B | High Priority | 65 |
4 | Product C | High Priority | 74 |
5 | Product A | Low Priority | 23 |
The steps to find the proximity measure for mixed attributes are as follows:
For nominal attributes:
In the example above, we have five objects. The formula to calculate the dissimilarity measure for nominal attributes is as follows:
where,
Let’s find the dissimilarity between two objects. For Objects 2 and 3,
The dissimilarity matrix for nominal attributes is as follows:
For ordinal attributes:
We have already assigned priorities to each data point. Next, normalize these priorities to fall from 0.0 to 1.0. We can map priorities with the help of the following formula:
where,
Let’s check how the updated normalized table for this example will look like:
Object Identifier | Test II Priorities | Test II Normalized Values |
1 | 1 | 0 |
2 | 3 | 1 |
3 | 2 | 0.5 |
4 | 2 | 0.5 |
5 | 1 | 0 |
With the normalized ranks, let’s calculate the dissimilarity between pairs of data points using the Euclidean distance formula. The Euclidean distance between two points
In our case:
Distance between Object 1 and 2 is
The dissimilarity matrix for ordinal attributes is as follows:
For numeric attributes:
Numeric attributes are variables with numerical values. We need to normalize these values to ensure fair comparison by adjusting them to a standard scale. To find dissimilarity, the Manhattan distance, commonly used for numeric attributes, operates more effectively when attributes are normalized. This normalization mitigates scale discrepancies, leading to more accurate distance computations.
The formula for the Manhattan distance is as follows:
Here’s what the formula represents:
According to the Manhattan distance formula, the distance between objects is as follows:
Distance between Object 2 and 1:
Distance between Object 3 and 1:
Distance between Object 3 and 2:
The matrix is as follows:
In our case
Now, combines the different attributes into a single dissimilarity matrix. The dissimilarity
where
or is missing where denotes the value of the attribute for object and denotes the value of the attribute for object .
, and attribute f is an
Otherwise,
As we can see, there is no missing value and also no asymmetric binary attribute, so
Now, apply the formula.
Objects | Calculation |
Object 2 and 1 | ((1*1) + (1*1) + (1*0.68)) / 3 = 0.89 |
Object 3 and 1 | ((1*1) + (1*0.5) + (1*0.29)) / 3 = 0.60 |
Object 3 and 2 | ((1*0) + (1*0.5) + (1*0.40)) / 3 = 0.30 |
Object 4 and 1 | ((1*1) + (1*0.5) + (1*0.41)) / 3 = 0.64 |
Object 4 and 2 | ((1*1) + (1*0.5) + (1*0.27)) / 3 = 0.59 |
Object 4 and 3 | ((1*1) + (1*0) + (1*0.13)) / 3 = 0.38 |
Object 5 and 1 | ((1*0) + (1*0) + (1*0.31)) / 3 = 0.10 |
Object 5 and 2 | ((1*1) + (1*1) + (1*1)) / 3 = 1.00 |
Object 5 and 3 | ((1*1) + (1*0.5) + (1*0.60)) / 3 = 0.70 |
Object 5 and 4 | ((1*1) + (1*0.5) + (1*0.73)) / 3 = 0.74 |
The final dissimilarity matrix is as follows:
As a result, we can say that:
Object 1 is highly similar to Object 5 with a dissimilarity score of 0.10.
Object 5 is highly dissimilar to Object 2 with a dissimilarity score of 1.00.
Object 3 is moderately similar to Object 1 with a dissimilarity score of 0.60.
In the context of dissimilarity for attributes of mixed types, extracting meaningful patterns involves quantifying differences between diverse data types such as numerical and categorical variables. This dissimilarity measure proves valuable in tasks like clustering heterogeneous datasets, facilitating effective feature selection, and enhancing the performance of machine learning models by capturing the nuanced relationships within multifaceted attribute sets.
Free Resources