Data Acquisition and Corpus Creation
We began by creating a data set of all Virginia laws passed in the long era of Jim Crow from 1865 to 1968. Virginia laws for each legislative session in this period were printed in the Acts and Joint Resolutions Passed by the General Assembly of the State of Virginia (Acts). All but six of the Virginia Acts volumes were openly accessible for educational, non-commercial use on HathiTrust, many of which were originally scanned from collections at UVA (examples here and here). The UVA Law Library scanned the remaining six volumes to complete the data set.
Using Python, we split the files for each Acts volume, as well as all accompanying text files, by page and saved them using a filename that denoted year, volume, and page number. We performed manual reviews to remove extraneous pages (e.g., title pages, indices, etc.).
Project Outputs:
- Data Acquisition and Corpus Creation: Request and prepare digital scans of Laws of Acts of Assembly of Virginia from HathiTrust and UVA Law Library, split and process mage files, remove paratextual information and isolate relevant text, record metadata.
- OCR Preparation and Execution: optimize images, perform OCR, and refine text extraction using Python’s Pillow and Tesseract libraries.
- Text Analysis: Utilize unsupervised and supervised machine learning techniques, including Latent Dirichlet Allocation (LDA) for topic modeling and classification of Jim Crow laws, drawing from various historical sources for training data.
- Corpus Dissemination and Outreach: Share corpora via open access platforms such as LibraData, and UVA Library’s instance of Dataverse, provide detailed documentation, and host workshops to introduce and support research efforts.