Avatar (Fabio Alessandro Locati|Fale)'s blog

Parse multi-CSV files in Go

August 22, 2022

As it happens every year, I had to pull a lot of information from various financial intermediaries and convert them into the way the local tax system wanted them. For this reason, I created a Go program that does this modularly. Whenever I have a new financial intermediary, I create a new module that can read whatever format they create and returns a structured object in the form that the application expects. This year I added a new financial intermediary with a very “interesting” CSV format.

To be able to condense multiple different kinds of data in the same file, they created a CSV file that has as the first column the name of the content (I think of it as the database table name). In the second column, we will find either Header or Data to distinguish if the row represents the headers for that “table” or rows of data, and then from the third column, it looks like a normal CSV file.

An example would be:

Cat A, Header, Col1, Col2, Col3
Cat A, Data, ValX1, ValX2, ValX3
Cat A, Data, ValY1, ValY2, ValY3
Cat B, Header, ColA, ColB
Cat B, Data, ValWA, ValWB
Cat B, Data, ValZA, ValZB

Also, a peculiarity it has is that if the last few values are NULL for a specific Data row, they are skipped. An example would be:

Cat A, Header, Col1, Col2, Col3
Cat A, Data, ValX1, ValX2
Cat A, Data, ValY1, ValY2, ValY3

Pointless to say, both those peculiarities are non-standard, and therefore the Go parser is very unhappy about parsing it. The first peculiarity does not directly conflict with what the CSV parser should expect since it has to do with the meaning of the fields that changes through the file more than the format itself. A consequence of the first peculiarity and the second one is that if the CSV parser expects the same number of columns on every line, it will be problematic. Pointless to say, Go’s CSV parser checks for that.

Looking the source code for the CSV parser, I’ve noticed that it uses the public property of the CSV Reader FieldPerRecord to check the number of columns, and if the value of FieldPerRecord is equal to 0, then the check gets skipped and FieldPerRecord gets set to the current amount of columns. This means that if we reset FieldPerRecord before any row reads, the CSV Reader will effectively always skip the number of columns check. With this knowledge, let’s create a simple function capable of parsing a CSV file with different column counts for every row.

 1func readMultiCSV(filename string) (error) {
 2    input, err := os.Open(filename)
 3    if err != nil {
 4        return err
 5    }
 6    defer input.Close()
 7
 8    reader := csv.NewReader(input)
 9    for {
10        r, err := reader.Read()
11        if err == io.EOF {
12            break
13        }
14        if err != nil {
15            return nil, err
16        }
17        fmt.Println(r)
18        reader.FieldsPerRecord = 0
19    }
20    return nil
21}

This code is a standard function to open a CSV and parse it line by line, if not for line 18, which resets the CSV Reader’s FieldsPerRecord value after it reads every line.

We now have to distinguish between the Header lines from the Data line. To do so, we will substitute line 17 of the previous example with two if statements.

 1func readMultiCSV(filename string) (error) {
 2    input, err := os.Open(filename)
 3    if err != nil {
 4        return err
 5    }
 6    defer input.Close()
 7
 8    reader := csv.NewReader(input)
 9    for {
10        r, err := reader.Read()
11        if err == io.EOF {
12            break
13        }
14        if err != nil {
15            return nil, err
16        }
17        if r[1] == "Header" {
18            fmt.Printf("HEADER: %s\n",r)
19        }
20        if r[1] == "Data" {
21            fmt.Printf("DATA: %s\n",r)
22        }
23    }
24    return nil
25}

This code could have been written differently, but I’m not sure I want to assume that Header and Data are the only two options, so I prefer to double-check.

We can now change this code a little bit so that we have what we expect. To be more precise we are going to create a map structure to map the various fields in Data lines with their equivalent in the Header line in what we could define as a DataSet in the format of []map[string]string. We will then create a map of those using the first column value as key. The resulting structure will be map[string][]map[string]string. Since this will allow us to have a single object that contains all data in the file, we will be able to change the signature of the function and we can drop the fmt.Printf. The resulting code will be:

 1func readMultiCSV(filename string) (map[string][]map[string]string, error) {
 2    input, err := os.Open(filename)
 3    if err != nil {
 4        return nil, err
 5    }
 6    defer input.Close()
 7
 8    reader := csv.NewReader(input)
 9    dataSets := map[string][]map[string]string{}
10    var header []string
11    var dataSet []map[string]string
12    var dataSetName string
13    for {
14        r, err := reader.Read()
15        if err == io.EOF {
16            break
17        }
18        if err != nil {
19            return nil, err
20        }
21        if r[1] == "Header" {
22            // Save previous dataSet
23            dataSets[dataSetName] = dataSet
24            // Prepare structures for current dataSet
25            header = r
26            dataSetName = r[0]
27            dataSet = []map[string]string{}
28        }
29        if r[1] == "Data" {
30            d := map[string]string{}
31            for c, h := range header {
32                if c < 2 {
33                    continue
34                }
35                if len(r) > c {
36                    d[h] = r[c]
37                }
38            }
39            dataSet = append(dataSet, d)
40        }
41        reader.FieldsPerRecord = 0
42    }
43    return dataSets, nil
44}

Undoubtedly this implementation is not the cleanest option, but it’s pretty quick to write, the code is intuitive, and it does what is required.

I hope you will never encounter weirdly formatted CSV files, but if you ever do, I hope this post will give you some ideas on how the Go’s CSV Reader can help you to accomplish your goal.

CSV, GoLang