Our goal was to generate structured data, stored as
xml from unstructured PDF’s, enabling detailed analysis, such as how many times a certain congressman spoke, for how long, and whether they made references to other congressman.
Although that project never saw the light of day, this contribution showcases the power of CLI-based tools and how they can be used to do heavy lifting as part of complex data pipelines.
The file format they publish, although not formally structured, DOES have certain human-interpretable structure: a table of contents, numberings, certain formatting hints around interventions, etc.
The https://github.com/esmitperez/asamblea_abierta/blob/master/extraer_xml.sh relies on the available formatting in the unstructured document and does the following:
- Convert from
- Extract the “Roll call” table of congressmen/congresswomen
- Convert it to
- Do some basic analysis, like determining quorum
- Mark interventions by congress person or congress president
- Construct a final XML with all the extracted data and metadata.
Once created, simple queries can be done on the
.xml file using XPATH expressions:
# Extract Quorum $ xmllint --xpath 'count(/acta/atendencia/diputado/apellidos)' \ actas_xml/2018-2019-PLENARIO-SESION-1.xml # How many times the Spokesman spoke $ xmllint --xpath 'count(/acta/fragmento[@tipo="presidente" and @interino="false"])' \ actas_xml/2018-2019-PLENARIO-SESION-1.xml