Navraj Narula

The dataset that I am using comes from Sanford Weisberg’s book on applied linear regression. It has been neatly extracted into a workable format by Germán Rodríguez, a faculty member at Princeton University. You can access it here.

The data, labeled by Rodríguez as "Discrimination in Salaries," contains 52 instances and six attributes. Each instance represents attributes associated with a tenure-track professor in a small college. The attributes include:

* sx – the sex of the professor (male or female) [str]
* rk – rank of the professor (assistant, associate, or full) [str]
* yr – number of years in current rank [int]
* dg – highest degree earned (doctorate or masters) [str]
* yd – number of years since highest degree earned [int]
* sl – academic year salary in dollars [int]

I represented all instances of my dataset in an alluvial diagram, as depicted above. An alluvial diagram allows for depiction of information flow and the addition of facets of the data in the form of steps. I used RAWGraphs to help create mine.

I’ve chosen to visualize the salary amount as a general flow throughout the diagram and added dg, sx, and rk as steps. The diagram is ordered by the size of flow for each attribute class. In this case, the size represents the salary earned per class throughout each individual flow.

The visualization itself is visually pleasing and communicates a lot of information, even for someone who might only have a second to look at it. However, it may not be representative of the information actually presented in the dataset itself.

We might say that the majority of people with doctorate degrees are male and full-time professors who receive the most amount of salary. However, this case may only be half true for female professors with the same education level and rank. This fact (along with many others) is visually true by examining flow patterns in the diagram.

Each flow is accurate in terms of size, but the values in the flow itself may be misconstrued. Try hovering over the visualization. You'll see numbers appear throughout each flow chunk. You may believe that these numbers represent the salary that each person within each flow receives. However, this is not true. The values represent the sum of salaries for that particular group within the flow. The data is not representative of how much salary is made per person in each individual group, but rather how much salary is made overall per group by those professors represented in the data.

For instance, there may be only two professors represented in a large chunk and one of them might be making 90% more dollars than another professor. However, there is no way to know how many professors are represented per chunk or how much money they may be making individually or on average.

It also may be the case that one female professor earns a higher salary than a male professor as a doctorate with a full-time rank, but we are unable to see that in this diagram since each flow represents a sum and not individual data points.

Alluvial diagrams tend to concatenate information rather than depict specifics that may be representative of each individual group. This, of course, depends on the number of steps added to the diagram. The sum of salaries in this case is not ideal to really depict salary discrimation.