Intro blurb Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque enim tortor, blandit vitae, dignissim sed, lobortis vitae, quam. Donec in enim. Curabitur vitae orci. Pellentesque lobortis arcu sit amet turpis. Sed sit amet massa sit amet mi condimentum porta. Cras urna risus, nonummy sed, ultricies ut, mollis vitae, lorem. Fusce ultrices metus eu tellus. Vestibulum ac risus a augue tincidunt tempor. In hac habitasse platea dictumst. Morbi varius. Donec eget nunc. Nam vel felis sit amet elit pulvinar molestie. Proin purus justo, vulputate sit amet, pretium vitae, dignissim at, ipsum. Mauris mattis mauris quis tellus.
Primary data
Description of the data received from the client:
- single excel file:
DHMD Indiana University Iconclass Final EN.xls
- one sheet:
_tmpVC
- 7700 records + 1 header row = 7701 rows
Data headings
heading
(unique values / values present)
Inventar Nr
(3306 / 7700)Obj_ Id
(3306 / 7700)Title
(2930 / 7700)Land
(134 / 7700)Language
(106 / 7627)Jahr Von Datierung
(29 / 7155)Datierung: Jahr Bis Datierung
(27 / 3866)Notation Iconclass DE
(648 / 7005)Notation Iconclass EN
(637 / 7005)Notation Iconclass DE
converted to English.Processing of the primary data
Translation and renaming of headers
For convenience and consistency, the following column names will be used in place of the original headings.
column name | original heading |
---|---|
inventory |
Inventar Nr |
id |
Obj_ Id |
title |
Title |
origin |
Land |
languages |
Language |
year_from |
Jahr Von Datierung |
year_to_ |
Datierung: Jahr Bis Datierung |
ic_en_ |
Notation Iconclass EN |
Determine data requirements
The analysis to be completed has the following record requirements:
- a unique poster identifier:
id
- a location for any geospatial analysis:
origin
- a date for temporal analysis:
year_from
- at least one valid code for topical analysis:
ic_en
Audit
A number of records were removed or modified to conform with the requirements. Additionally, some assumptions and substitutions were made to allow exploring the data further. The list below summarizes the changes.
year_to = year_to_ or year_from
Assume one year in-use whereyear_to_
is not provided (3289 records updated)language_obsv
table of normalized language observations into discrete records (id, language), remove "?" (3547 records created)- normalize
origin
, merge variants (127 distinct records) ic_obsv
table of Iconclass observations; created to reduced poster record duplication and errors (6993 valid records)ic_codes
table of distinct Iconclass codes; created to manage all distinct codes (636 records ), with the follow columns:ic_en_
: english notation as received, lowercasedic_en
: controlled, valid list, with corrections or replacements (see table below)ic_stem
: stemmed alphanumeric code portion onlyic_exp
: expansion code of the notation, if anyic_text
: text descriptor for the notation, if anyic_text_tokens
: tokenized version ofic_text
ic_parent
: the parent Iconclass codeic_depth
: the depth of the current code in the Iconclass hierarchyic_d1
...ic_9
: ancestors of the current code for depths 1 through 9ic_d_tokens
: pipe delimited list ofic_d1 - 9
- remove anomalies: "kellner"
- replace or correct invalid/incomplete notations (see table below)
notation | correction |
---|---|
jeans | 41D231 trousers |
red ribbon | 41D2653 ribbon |
green ribbon | 41D2653 ribbon |
hiv | 31A4623 contagious diseases |
condom | 42A193 birth control * |
* Given the subject matter, "birth control" is rarely the intended function of the condoms. However, it is sufficiently distinct from all other Iconclass codes so as to produce results equivalent to the case where a more appropriate code existed.
Summary of validated and normalized data
- 2447 posters
- 6993 Iconclass observations (
ic_obsv
) - 636 Iconclass codes used (
ic_codes
) - 3547 language observations (
language_obsv
) - 61 languages used
- 127 distinct origins
- 29 distinct years, none from 2011
Secondary data
Google Translate (via Google Sheets)
Google Sheets has implemented a formula for translating text. The following process was used to obtain approximate translations of the poster titles:
- upload a CSV of distinct titles
- B2 = GOOGLETRANSLATE(A2,"auto","en"), copy down
- export CSV download
- strip out line breaks
- strip out double quotes
Arkyves.org
The client provided the following link: Arkyves.org AIDS Poster Collection. Using a Python script, URLs were scraped for the thumbnail images and the associated DHMD collection pages. The collection page URL matches the poster id, allowing it and the thumbnail images to be joined to the primary data.
Country Centroids
county_centroids_all.csv
- obtained from Gothos
- A zip file containing tab delimited text files with country centroids in latitude and longitude and FIPS and ISO codes (two-letter alpha) for the world’s countries and territories. Includes metadata. Extracted from the NGA GNS and the USGS GNIS in Feb 2012.
Country and region boundaries
- obtained from NaturalEarthData.com
- Admin 0 - Countries
- contains FIPS10 codes
- in QGIS join country_centroids_all.csv
- dissolving by region to create region boundaries
Composite data
The primary and secondary data were imported into an SQLite database. Normalization tasks were performed either before the import, or immediately after, depending on the capabilities of SQLite. Specifically, expanding aggregated values into discrete records is not possible as of SQLite 3.x.
Tables produced from the primary data:
import (inventory, id, title, origin, languages, year_from, year_to_, ic_de, id_en)
ic_obsv (id, ic_en_)
ic_codes (ic_en_, ic_en, ic_stem, ic_exp, ic_text, ic_text_tokens, ic_parent, ic_depth, ic_d1...ic_d9, ic_d_tokens)
language_obsv (id, language)
Tables produced from country_centroids_all.csv
:
origin_names (origin, country, fips10, region)
(136 records). Each distinct poster origin was matched to a country.country_centroids (LAT,LONG,…,FIPS10,SHORT_NAME,…)
(257 records)
Table produced from Google Sheets/Translate:
translate_title (title, title_en)
(2954 records)
Table produced from arkyve.org scrape:
poster_urls (id, url, img)
(2600 records)
Additional composite tables created for convenience:
posters (inventory, id, title, title_en, origin, year_from, year_to, url, img)
(2447 records)
Exports (CSV)
The purpose of using an SQLite database is to produce a variety of CSV tables suitable for other software, while minimizing duplication, errors, and repetition. As such, once the data has been normalized, only basic join operations are needed.
ic_stats
pivot_prep...
ic_d1
over year_from
.burst_prep...
net_prep...
node_data
Next >> Research