Intro blurb Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque enim tortor, blandit vitae, dignissim sed, lobortis vitae, quam. Donec in enim. Curabitur vitae orci. Pellentesque lobortis arcu sit amet turpis. Sed sit amet massa sit amet mi condimentum porta. Cras urna risus, nonummy sed, ultricies ut, mollis vitae, lorem. Fusce ultrices metus eu tellus. Vestibulum ac risus a augue tincidunt tempor. In hac habitasse platea dictumst. Morbi varius. Donec eget nunc. Nam vel felis sit amet elit pulvinar molestie. Proin purus justo, vulputate sit amet, pretium vitae, dignissim at, ipsum. Mauris mattis mauris quis tellus.

Primary data

Description of the data received from the client:

  • single excel file: DHMD Indiana University Iconclass Final EN.xls
  • one sheet: _tmpVC
  • 7700 records + 1 header row = 7701 rows

Data headings

heading (unique values / values present)

Inventar Nr (3306 / 7700)
Inventory number for the poster in the DHMD collection.
Obj_ Id (3306 / 7700)
Poster identifier. Can be used to aggregate all IC observations for each poster. However, further inspection of the titles and thumbnail images has exposed duplicates (ie: two distinct id values that reference the same poster design). Perhaps because there are two physical copies in the DHMD collect. In the interests of time, this issue will be addressed at a later date, if at all.
Title (2930 / 7700)
The title, caption, or primary phrase of the poster in the original language, where romanization is possible. Alternatively, hard brackets enclosed a description of the script. Presumably this was a technological limitation (eg, data collection input method, or database encoding). As noted above, some duplicates poster designs exists which - in part - explains why there are fewer titles than IDs. Another explanation would be multilple posters using the same phrase, accidentally or as a part of a larger ad.
Land (134 / 7700)
The geographic origin of the poster. Field values are not controlled by vocabulary or level of detail. Multiple values exist for the same intended place. Some locations are countries, while others are states or provinces.
Language (106 / 7627)
A comma separated list of languages used on the poster. The list is not sorted by any known function - either alphabetic or by the quantity of text on the poster. Therefor, multiple permutations exist: "English, Spanish" and "Spanish, English". This field will need to be normalized into discrete observations. Some values contain "?" to indicate uncertainty. This will be stripped out and assumed accurate, or ignored.
Jahr Von Datierung (29 / 7155)
The year the poster was published.
Datierung: Jahr Bis Datierung (27 / 3866)
The year up to which the poster was in use.
Notation Iconclass DE (648 / 7005)
Iconclass notation observation. In German, the original language with which the data was collected. A quick review of the codes indicates that some are invalid or incomplete (AIDS, Condom, Jeans).
Notation Iconclass EN (637 / 7005)
Notation Iconclass DE converted to English.

Processing of the primary data

Translation and renaming of headers

For convenience and consistency, the following column names will be used in place of the original headings.

column name original heading
inventory Inventar Nr
id Obj_ Id
title Title
origin Land
languages Language
year_from Jahr Von Datierung
year_to_ Datierung: Jahr Bis Datierung
ic_en_ Notation Iconclass EN


Determine data requirements

The analysis to be completed has the following record requirements:

  • a unique poster identifier: id
  • a location for any geospatial analysis: origin
  • a date for temporal analysis: year_from
  • at least one valid code for topical analysis: ic_en

Audit

A number of records were removed or modified to conform with the requirements. Additionally, some assumptions and substitutions were made to allow exploring the data further. The list below summarizes the changes.

  • year_to = year_to_ or year_from Assume one year in-use where year_to_ is not provided (3289 records updated)
  • language_obsv table of normalized language observations into discrete records (id, language), remove "?" (3547 records created)
  • normalize origin, merge variants (127 distinct records)
  • ic_obsv table of Iconclass observations; created to reduced poster record duplication and errors (6993 valid records)
  • ic_codes table of distinct Iconclass codes; created to manage all distinct codes (636 records ), with the follow columns:
    • ic_en_ : english notation as received, lowercased
    • ic_en : controlled, valid list, with corrections or replacements (see table below)
    • ic_stem : stemmed alphanumeric code portion only
    • ic_exp : expansion code of the notation, if any
    • ic_text : text descriptor for the notation, if any
    • ic_text_tokens : tokenized version of ic_text
    • ic_parent : the parent Iconclass code
    • ic_depth : the depth of the current code in the Iconclass hierarchy
    • ic_d1 ... ic_9 : ancestors of the current code for depths 1 through 9
    • ic_d_tokens : pipe delimited list of ic_d1 - 9
    • remove anomalies: "kellner"
    • replace or correct invalid/incomplete notations (see table below)
notation correction
jeans 41D231 trousers
red ribbon 41D2653 ribbon
green ribbon 41D2653 ribbon
hiv 31A4623 contagious diseases
condom 42A193 birth control *

* Given the subject matter, "birth control" is rarely the intended function of the condoms. However, it is sufficiently distinct from all other Iconclass codes so as to produce results equivalent to the case where a more appropriate code existed.


Summary of validated and normalized data

  • 2447 posters
  • 6993 Iconclass observations (ic_obsv)
  • 636 Iconclass codes used (ic_codes)
  • 3547 language observations (language_obsv)
  • 61 languages used
  • 127 distinct origins
  • 29 distinct years, none from 2011

Secondary data

Google Translate (via Google Sheets)

Google Sheets has implemented a formula for translating text. The following process was used to obtain approximate translations of the poster titles:

  • upload a CSV of distinct titles
  • B2 = GOOGLETRANSLATE(A2,"auto","en"), copy down
  • export CSV download
  • strip out line breaks
  • strip out double quotes

Arkyves.org

The client provided the following link: Arkyves.org AIDS Poster Collection. Using a Python script, URLs were scraped for the thumbnail images and the associated DHMD collection pages. The collection page URL matches the poster id, allowing it and the thumbnail images to be joined to the primary data.

Country Centroids

county_centroids_all.csv

  • obtained from Gothos
  • A zip file containing tab delimited text files with country centroids in latitude and longitude and FIPS and ISO codes (two-letter alpha) for the world’s countries and territories. Includes metadata. Extracted from the NGA GNS and the USGS GNIS in Feb 2012.

Country and region boundaries

  • obtained from NaturalEarthData.com
  • Admin 0 - Countries
  • contains FIPS10 codes
  • in QGIS join country_centroids_all.csv
  • dissolving by region to create region boundaries

Composite data

The primary and secondary data were imported into an SQLite database. Normalization tasks were performed either before the import, or immediately after, depending on the capabilities of SQLite. Specifically, expanding aggregated values into discrete records is not possible as of SQLite 3.x.

Tables produced from the primary data:

  • import (inventory, id, title, origin, languages, year_from, year_to_, ic_de, id_en)
  • ic_obsv (id, ic_en_)
  • ic_codes (ic_en_, ic_en, ic_stem, ic_exp, ic_text, ic_text_tokens, ic_parent, ic_depth, ic_d1...ic_d9, ic_d_tokens)
  • language_obsv (id, language)

Tables produced from country_centroids_all.csv:

  • origin_names (origin, country, fips10, region) (136 records). Each distinct poster origin was matched to a country.
  • country_centroids (LAT,LONG,…,FIPS10,SHORT_NAME,…) (257 records)

Table produced from Google Sheets/Translate:

  • translate_title (title, title_en) (2954 records)

Table produced from arkyve.org scrape:

  • poster_urls (id, url, img) (2600 records)

Additional composite tables created for convenience:

  • posters (inventory, id, title, title_en, origin, year_from, year_to, url, img) (2447 records)

 

Exports (CSV)

The purpose of using an SQLite database is to produce a variety of CSV tables suitable for other software, while minimizing duplication, errors, and repetition. As such, once the data has been normalized, only basic join operations are needed.

ic_stats
A table for exploring possible trends and patterns with LibreOffice Calc charts, or similar. The table subject is Iconclass codes and their frequency, average weight, min/max/avg year_from, and year span
pivot_prep...
Multiple tables produced that are suitable for pivot tables and charts. For example, stacked area graphs of ic_d1 over year_from.
burst_prep...
A series of tables for importing into Sci2 for burst analysis, and temporal bar graphs.
net_prep...
A series of tables for importing into Sci2 and extracting networks.
node_data
This table was needed to join poster data to the Gephi node ids.

 


Next >> Research