Network Analysis: Finding the Regional Influences of Codification

In this series of posts I have been greatly aided by Lincoln Mullen, assistant professor at George Mason University and a frequent collaborator with the Roy Rosenzweig Center for History and New Media there. You can find additional information on this post, including all the coding where we “show our work,” at its associated RPub.

Since my last post, I’ve pulled together about thirty more codes, OCR’d them, and cleaned up the resulting text for comparison. This larger sample size allows for more sophisticated comparison than the one-off density plots, but the basic computation remains the same. With the density plot experiment, we asked “how much of the text in X code matches the text in Y code, and where are are those matches distributed in X code?”

Now instead of comparing one jurisdiction with another single jurisdiction, we can compare all codes to each other all at once and and graph the percentage of matching n-grams between each code:

##                AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## AZ1865          100.0   30.4   75.9    4.4   16.9    1.5    6.7    2.4
## CA1850           13.2  100.0   13.9    3.8   16.4    0.9    5.0    1.7
## CA1851           76.8   32.6  100.0    4.4   17.9    1.5    7.2    2.5
## CT1879            0.3    0.6    0.3  100.0    0.4    0.1    0.3    0.1
## FL1870           18.0   40.1   18.8    6.6  100.0    1.4    7.0    5.2
## GA1860            2.2    3.1    2.2    2.9    2.0  100.0    2.0    1.7
## KY1851            6.1   10.5    6.5    3.7    6.0    1.2  100.0    1.6
## MI1853            1.3    2.1    1.3    0.6    2.6    0.6    0.9  100.0
## MO1849            5.4   13.2    5.6    1.7    6.8    0.6    4.4    1.2
## MO1856            4.4   10.8    4.5    5.3    5.9    0.9    4.3    1.9
## NC1868           17.6   40.3   18.4    6.7   44.9    1.7    7.3    2.6
## NV1861           69.0   31.1   69.5    4.5   17.6    1.8    7.1    2.5
## NV1869           64.5   28.9   65.6    4.3   16.9    1.6    6.9    2.4
## NY1848           10.9   28.2   11.5    3.0   19.6    0.7    4.7    1.6
## NY1849           16.2   42.3   17.4    5.1   35.3    1.1    6.3    2.2
## NY1850           37.3   33.3   38.7    6.1   28.0    2.2   11.5    3.2...

In the above chart, we can ask for each column, “what percentage of n-grams in this code are accounted for in other codes?” with the answer listed by row. But since we are tracking influence over time, we can throw out many of these numbers as meaningless, since a later code can be influenced only by an earlier code (and, of course, we don’t care that any given code is 100% consistent with itself):

##                AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## AZ1865            N/A    N/A    N/A    4.4   16.9    N/A    N/A    N/A
## CA1850           13.2    N/A   13.9    3.8   16.4    0.9    5.0    1.7
## CA1851           76.8    N/A    N/A    4.4   17.9    1.5    7.2    2.5
## CT1879            N/A    N/A    N/A    N/A    N/A    N/A    N/A    N/A
## FL1870            N/A    N/A    N/A    6.6    N/A    N/A    N/A    N/A
## GA1860            2.2    N/A    N/A    2.9    2.0    N/A    N/A    N/A
## KY1851            6.1    N/A    6.5    3.7    6.0    1.2    N/A    1.6
## MI1853            1.3    N/A    N/A    0.6    2.6    0.6    N/A    N/A
## MO1849            5.4   13.2    5.6    1.7    6.8    0.6    4.4    1.2
## MO1856            4.4    N/A    N/A    5.3    5.9    0.9    N/A    N/A
## NC1868            N/A    N/A    N/A    6.7   44.9    N/A    N/A    N/A
## NV1861           69.0    N/A    N/A    4.5   17.6    N/A    N/A    N/A
## NV1869            N/A    N/A    N/A    4.3   16.9    N/A    N/A    N/A
## NY1848           10.9   28.2   11.5    3.0   19.6    0.7    4.7    1.6
## NY1849           16.2   42.3   17.4    5.1   35.3    1.1    6.3    2.2
## NY1850           37.3   33.3   38.7    6.1   28.0    2.2   11.5    3.2...

Because the quality of OCR can differ from text to text, we have to be careful about what we can legitimately claim about these comparisons. CA1851 accounts for 77% of AZ1865, while NC1868 accounts for 45% of FL1870–but those percentages include printing blotches, OCR corruptions, typos, etc. So it’s not the case that the actual Arizona code was 77% derived from California’s. (Actually, the “real” percentage is probably quite a bit higher.) Nor is it necessarily the case that Arizona’s code matches California’s better than Florida’s matches North Carolina’s, since the Florida text may simply be a mess.

But even if the comparison matrix is imprecise, it’s not worthless. With each code we can see relatively significant leaps in correlation. Several codes track relatively well with the New York drafts, but have much stronger correlations with a regional neighbor, while some states like Connecticut seem to be doing their own thing. Where the percentage are close–say, between California and Nevada in Arizona’s column–more work may need to be done to establish which was the more influential code, but even from this partial chart we can see that statutes tend to correlate much more highly among regional neighbors than with the early New York code that kicked off the procedural revolution.

Instead of one-to-one density plots comparing distribution of matching n-grams throughout a code, we can now visualize the connections between multiple codes to get a sense of regional text families. If we set the minimum threshold at 10% correlation, we can cut out the noise of incidental matches and see where the text of earlier codes links up with later ones:

all10

Among all these statutes, we can now clearly see where the Field Code family is. A number of states on the East Coast and in the South enacted codes, but apparently these projects were not influenced by the text of the Field Code, even if they adopted Field reforms, nor were any of these non-Field Code projects related to one another. Massachusetts did its own thing, as did Alabama, etc. Significantly, Louisiana, an early code which closely followed the civil law tradition, is sitting out there by itself, not influencing any later code. Another interesting connection on the periphery is Iowa to Utah–perhaps Mormon lawyers had an affinity for the laws of their pre-migration neighbor.

We can now raise the correlation threshold to 25% and lop off the periphery, to see what we can discern among that jumble that is the Field Code family:

selected25

As we would expect, New York 1850, the final polished draft of the original code, is at the center of the universe of influence. But there are clearly text families. The South is closely clustered together, as is the West, with no lines running between them except through New York. Among the western states, only Oregon seems to have drawn directly on New York rather than the regional center of gravity California.

Though I still need to add a couple Midwestern states to my sample, the ones that are already there drop out when the threshold is raised this high. So that poses and analytic question for further research: Most of the Midwest was influenced by the Field reforms, but few decided to adopt large portions of the Field text. What did they do instead?