Data
Article Corpus and Data Model
This page contains a description of the data used in this project, aiming to provide information on both the workflow and the content produced by running it, as well as to present the data model for the resulting corpus and other data output.
Article Corpus
The corpus is created based on the bibliography provided by Ruedin and Hanak (2008, 216–20). I did not include all publications into the data set but focused on the articles published in newspapers and professional journals. This focus leaves out ‹gray literature› – mainly technical documents, planning reports, and printed conference proceedings – and concentrates on texts that were intended for a wider audience and can be understood as part of relatively regular publishing activity.
The 189 articles by Hans Marti that were considered1 for creating the corpus are listed in Table 1. All of them are accessible via e-periodica.ch or e-newspaperarchives.ch.
Workflow
A multi-step workflow (Figure 1) is used to scrape the articles from the web. After some cleaning, the quality of the text files is good enough to be used for modelling topics with quanteda (Benoit et al. 2018). Topic models and the result of the analysis are summarised in a report which is rendered with Quarto and deployed to GitHub Pages. The repository as a whole is archived on Zenodo.
%%{ init: { 'themeVariables': { 'edgeLabelBackground': '#fcfcfc'} } }%%
flowchart TD
subgraph Resources
RN2[("e-periodica.ch<br>e-newspaperarchives.ch")] -->|Article IDs| RN13
RN1[(Literature Research)] -->|Bibliography| RN13
RN1 -->|Biographical Context| RN13
end
CB1 -->|Report| RN9
RN14 -->|*.txt<br>Files| CB2
subgraph Data Storage
RN9(["modelling-marti <br>(GitHub Repository)"])
RN9 -->|Quarto| RN11([Online<br>Documentation])
RN9 --> RN10([Zenodo])
RN5([Zotero Bibliography])
end
subgraph Analysis
RN14 -->|Processed<br>Article Texts| CB1[Analysis<br>(Quanteda)]
CB2([Corpus]) --> RN9
RN2 -.->|Download| RN14
end
RN13([docs/articles_metadata.csv]) --> RN9
RN1 ------> RN5
RN13 --> RN14([Full Articles])
style CB1 fill:#90C987
style CB2 fill:#7BAFDE
style RN1 fill:#D9CCE3
style RN2 fill:#D9CCE3
style RN5 fill:#F6C141
style RN9 fill:#F7F056
style RN10 fill:#F7F056
style RN11 fill:#F6C141
style RN13 fill:#7BAFDE
style RN14 fill:#7BAFDE
Data Model
Article Data
All data is available in csv format. For this file format, metadata is provided in json files according to the W3C Metadata Vocabulary for Tabular Data (W3C 2022). In the R workflow, the metadata object is used to create a list of articles which includes an URL to download the textual data into another R object text_data. The resulting marti_corpus contains the full articles, metadata on those articles and information on Marti’s work status at the time of publication for each article (Figure 2).
classDiagram
direction LR
metadata <|-- articles
articles <.. marti_corpus
text_data <.. marti_corpus
raw_text "1..*" o-- text_data
berufslaufbahn <|-- marti_corpus
class metadata {
id : str
title : str
publication : str
date : str (%d.%m.%y)
language: str (ISO 639-1)
standalone: bool
first_row: int
last_row: int
archive_id: str
note: str
}
class articles {
id : str
title : str
publication : str
date : str (%d.%m.%y)
language: str (ISO 639-1)
standalone: bool
first_row: int
last_row: int
archive_id: str
note: str
url: str
}
class text_data {
doc_id: str
text: str
}
class raw_text {
*.txt (article texts)
}
class berufslaufbahn {
Beruf : str
Start : str (%Y-%m-%d)
Ende : str (%Y-%m-%d)
}
class marti_corpus {
doc_id: str
text: str
title: str
publication: str
date : str (%Y-%m-%d)
language: str (ISO 639-1)
VLP: bool
SBZ: bool
Delegierter: bool
Gemeinderat: bool
Pensionierung: bool
fachpublikum: bool
pol_mandat: bool
}
style marti_corpus fill:#F7CB45
style text_data fill:#FFEAAE
style articles fill:#D1BBD7
Corpus Metadata
The marti_corpus corpus object is stored as csv file in this repository. A json metadata file is available with accompanying information. The content of the corpus is described below.
| variable | datatype | description | example value |
|---|---|---|---|
doc_id |
string | unique identifier for each article | NZZ_19550205_0010 |
text |
string | full processed text body as plain text | Auf die in der umstrittenen Basler… |
title |
string | title of the article as published | Der Traum von der neuen Stadt |
publication |
string | name of the journal/newspaper | NZZ |
date |
string | date of publication, %Y-%m-%d format |
1955-02-05 |
language |
string | main language the article is written in, ISO 639-1 code |
de |
VLP |
boolean | Marti worked at the VLP at the time of publication | FALSE |
SBZ |
boolean | Marti worked at the Bauzeitung at the time of publication | TRUE |
Delegierter |
boolean | Marti worked as Delegate of the City Government at the time of publication | FALSE |
Gemeinderat |
boolean | Marti was member of the City Parliament at the time of publication | FALSE |
Pensionierung |
boolean | Marti was retired at the time of publication | FALSE |
fachpublikum |
boolean | architecture/planning professionals are the main audience of the publication | FALSE |
pol_mandat |
boolean | Marti held a public office at the time of publication | FALSE |
All information is the result of the processing workflow and taken from one of the data objects in earlier steps of the R workflow which are listed in more detail in the following sections.
Article Metadata
metadata
The metadata object in the R workflow is a data frame containing the articles’ metadata, mostly taken from the bibliography provided in Ruedin and Hanak (2008, 216–20), enhanced by extensive research on Marti and his work. The metadata object is stored as csv file in this repository.
| variable | datatype | description | example value |
|---|---|---|---|
id |
string | unique identifier for each article | NZZ_19550205_0010 |
title |
string | title of the article as published | Der Traum von der neuen Stadt |
publication |
string | name of the journal/newspaper | NZZ |
date |
string | date of publication as taken from scanned documents, %d.%m.%y format |
05.02.55 |
language |
string | main language the article is written in, ISO 639-1 code | de |
standalone |
boolean | complete article in one item (as opposed to publications spread over multiple issues) | TRUE |
first_row |
integer | number of first row in *.txt file containing actual content (i.e., not a headline) | 3 |
last_row |
integer | number of last row in *.txt file containing actual content (i.e., end of article) | 27 |
archive_id |
string | identifier used in permalinks at e-newspaperarchives.ch | NZZ19550205-01.2.30.2 |
note |
string | notes regarding the transcription or the article itself | Bildbeschreibung in den Text gerutscht |
articles
The articles object in the R workflow is a data frame based on the metadata basis data frame, but additionally contains an URL pointing to the full texts of each article on either e-periodica.ch or e-newspaperarchives.ch.
| variable | datatype | description | example value |
|---|---|---|---|
id |
string | unique identifier for each article | NZZ_19550205_0010 |
title |
string | title of the article as published | Der Traum von der neuen Stadt |
publication |
string | name of the journal/newspaper | NZZ |
date |
string | date of publication as taken from scanned documents, %d.%m.%y format |
05.02.55 |
language |
string | main language the article is written in, ISO 639-1 code | de |
standalone |
boolean | complete article in one item (as opposed to publications over multiple issues) | TRUE |
first_row |
integer | number of first row in *.txt file containing actual content (i.e., not a headline) | 3 |
last_row |
integer | number of last row in *.txt file containing actual content (i.e., end of article) | 27 |
archive_id |
string | identifier used in permalinks at e-newspaperarchives.ch | NZZ19550205-01.2.30.2 |
note |
string | notes regarding the transcription or the article itself | Bildbeschreibung in den Text gerutscht |
url |
string | URL to access the full article (raw text or PDF) | https://... |
Article Text
text_data
The text_data object in the R workflow is a data frame containing all the articles’ full texts after retrieval from either e-periodica.ch or e-newspaperarchives.ch and subsequent semi-automatic cleaning.
| variable | datatype | description | example value |
|---|---|---|---|
doc_id |
string | unique identifier for each article | jub-002_1964_26__231_d |
text |
string | full processed text body as plain text | Im Jahre 1947 wurde in Baden… |
Professional Career
Selected events on a timeline of Hans Marti’s professional career are stored in the corpus and used as corariate variables for topic modelling. The dates are stored in a csv file accompanied by a json metadata file. The data is based on information by Lendi (2018), Böcker (2007) and Koll-Schretzenmayr (2008).
| variable | datatype | description | example value |
|---|---|---|---|
Beruf |
string | unique identifier for each job | VLP |
Start |
string | start date of job, %Y-%m-%d format |
1945-01-01 |
Ende |
string | end date of job, %Y-%m-%d format |
1948-01-01 |
Geodata
For visualisation purposes, this project uses geodata to create a project map as part of the analysis. The map shows each project that Marti and/or his office were involved in. Information on the projects is taken from Ruedin and Hanak (2008, 222–24) and stored in data/geodata/marti-geodata-json.geojson. For printing a static map to the PDF version of the analysis, I use a topographic map of Switzerland provided by swisstopo (2025) which is clipped using boundary data from OpenStreetMap (data/geodata/CH_outline.geojson). See Figure 3 for an overview of the structure of the geojson files.
classDiagram
direction TD
class marti-geodata-json {
id: int
project: str
period: str
location: str
coordinates: str (EPSG:2056/CH1903+)
}
class CH_outline {
type: str
coordinates: str (CRS84)
}
Planning Project Inventory
The list of projects by Marti and his office is available as geojson file. The coordinates assigned to each project are not corresponding to any actual planning activities but point to a generic point within each project perimeter. The values for the other variables are taken from the project inventory by Ruedin and Hanak (2008, 222–24). The dataset is using Swiss Landeskoordinaten (EPSG:2056, CH1903+/LV95). A QGIS metadata file is available on GitHub alongside the geojson file itself.
| variable | datatype | description | example value |
|---|---|---|---|
id |
integer | unique identifier for each project | 104 |
project |
string | project title/description (in German) | Ortsplanung, Verkehrsplanung, Beratung Nationalstrassenführung, Quartierpläne |
period |
string | timeframe of project duration | 1960–1975 |
location |
string | name of location, mostly municipalities | Murten |
coordinates |
string | E/N coordinates of point geometry, EPSG:2056 (Swiss CH1903+/LV95 format) |
2575516.085672613698989, 1197518.942096951650456 |
Map of Switzerland
A map of Switzerland used in the PDF version of the report is produced using geodata from OpenStreetMap (OpenStreetMap Contributors 2025), more specifically: Relation #51701 v545. This data (Swiss border geometry) is available in a geojson file. The dataset is using WGS 84 coordinates (OGC:CRS84).
| variable | datatype | description | example value |
|---|---|---|---|
type |
integer | feature geometry type | MultiPolygon |
coordinates |
string | E/N coordinates of point geometries, OGC:CRS84 (WGS 84 format) |
9.1572929, 47.6659088 |
References
Footnotes
All non-German articles were removed before executing the topic modelling workflow.↩︎