🇨🇷In 2018, the project Asamblea Abierta
was started by Guido Jimenez in order to parse the records of congress sessions by the Costa Rican National Assembly, click here for an example.
Our goal was to generate structured data, stored as xml
from unstructured PDF’s, enabling detailed analysis, such as how many times a certain congressman spoke, for how long, and whether they made references to other congressman.
Although that project never saw the light of day, this contribution showcases the power of CLI-based tools and how they can be used to do heavy lifting as part of complex data pipelines.
My fork, specifically, provides a pre-processing script, with view to eventually create a full flown ETL. It can be found in https://github.com/esmitperez/asamblea_abierta.
The Challenge Link to heading
The file format they publish, although not formally structured, DOES have certain human-interpretable structure: a table of contents, numberings, certain formatting hints around interventions, etc.
The https://github.com/esmitperez/asamblea_abierta/blob/master/extraer_xml.sh relies on the available formatting in the unstructured document and does the following:
- Convert from
.pdf
to to.txt
- Extract the “Roll call” table of congressmen/congresswomen
- Convert it to
.xml
- Do some basic analysis, like determining quorum
- Mark interventions by congress person or congress president
- Construct a final XML with all the extracted data and metadata.
Once created, simple queries can be done on the .xml
file using XPATH expressions:
# Extract Quorum
$ xmllint --xpath 'count(/acta/atendencia/diputado/apellidos)' \
actas_xml/2018-2019-PLENARIO-SESION-1.xml
# How many times the Spokesman spoke
$ xmllint --xpath 'count(/acta/fragmento[@tipo="presidente" and @interino="false"])' \
actas_xml/2018-2019-PLENARIO-SESION-1.xml