Reading CSV data with GAUSS

Introduction

Minimum GAUSS version: 16.0

The GAUSS commands csvReadM and csvReadSA read data into GAUSS from delimited text files. These files assume that the data is separated by a comma; but as we will see, they can also read data that is delimited by tabs, spaces, semi-colons or any character. The difference between these two functions is that csvReadM reads data into a GAUSS matrix, while csvReadSA, reads data into a GAUSS string or string array.

Filename input

csvReadM and csvReadSA both have one required input, the name of the data file. They also both have one output, the data read from the specified file. If you do not provide a path to your file, csvReadM and csvReadSA will look for the file in your current working directory.

For our first example, we will read in some data from an example data file that is packaged with GAUSS. Before we read the data, we will change our working directory to the GAUSSHOME/examples directory. We can perform this action with the following commands:

//Get the name of our GAUSS examples directory
examp_dir = getGAUSSHome() $+ "examples";

//Change the working directory to the GAUSSHOME/examples directory
//Note: The caret operator (^) tells GAUSS to look at the contents
//of 'examp_dir', rather than the literal token, 'examp_dir'
chdir ^examp_dir;

Now that we have set our working directory, we can load and preview some example data:

//Load all rows and all columns of 'housing.csv'
//into the GAUSS matrix 'housing'
housing = csvReadM("housing.csv");

//Print the first 5 rows to the GAUSS program input/output window
print housing[1:5,.];

The output from the code above should look like this:

      .          .          .          .          .          .
3104.00       4.00       2.00       0.00     279.90    2048.00
1173.00       2.00       1.00       0.00     146.50     912.00
3076.00       4.00       2.00       0.00     237.70    1654.00
1608.00       3.00       2.00       0.00     200.00    2068.00

The first thing we notice is that the first row consists entirely of dots '.'. This is because the first row of housing.csv contains the header. Since we specified that the data should be read in as a matrix, the text data was converted to a missing value. Let us take a look at all of the inputs to csvReadM and csvReadSA and then work through some more simple examples.

csvReadM and csvReadSA input options

filename: string, the name of the data file to read from
row_range: Optional input. Scalar or 2x1 matrix, the rows to read from the data file. If row_range is a scalar, then data will be read from row row_range to the end of the file.
col_range: Optional input. Scalar or 2x1 matrix, the columns to read from the data file. If col_range is a scalar, then data will be read from column col_range to the end of the file.
delimiter: Optional input. String, the character used to separate data elements. Default = ",".

Examples

Example 1: Specify a starting row

Continuing with our example from above, if we wanted to load the numeric housing data into a matrix without the header row, we simply add the optional row_range argument like this:

//Load rows 2 to the end and all columns of 'housing.csv' into
//into the GAUSS matrix 'housing'
housing = csvReadM("housing.csv", 2);

//Print the first 5 rows to the GAUSS program input/output window
print housing[1:5,.];

This time, our code should produce this output:

3104.00       4.00       2.00       0.00     279.90    2048.00
1173.00       2.00       1.00       0.00     146.50     912.00
3076.00       4.00       2.00       0.00     237.70    1654.00
1608.00       3.00       2.00       0.00     200.00    2068.00
1454.00       3.00       3.00       0.00     159.90    1477.00

Example 2: Read in string data and specify a single row

In our last example, we read in just the numeric data from housing.csv. If we want to load the header names as a string array, we can do this:

//First and last row to read is the first row
row_range = { 1, 1 };

//Read header names into a string array, named 'var_names'
var_names = csvReadSA("housing.csv", row_range);

//Print the header
print var_names;

This code should return:

taxes     beds    baths      new    price     size

Example 3: Read numeric data from a row range

//First and last row to read
row_range = { 3, 5 };

//Read all columns of row 3 through 5
housing = csvReadM("housing.csv", row_range);

//Print all data read in
print housing;

The code above should return the following output:

1173.00       2.00       1.00       0.00     146.50     912.00
3076.00       4.00       2.00       0.00     237.70    1654.00
1608.00       3.00       2.00       0.00     200.00    2068.00

Example 4: Read rows 2 to the end from a range of columns

//Start from the second row
row_range = 2;

//Read columns 2 through 4
col_range = { 2, 4 };

//Read all columns of row 3 through 5
housing = csvReadM("housing.csv", row_range, col_range);

//Print the first 5 rows of the data read in
print housing[1:5,.];

The code above should return the following output:

4.00        2.00        0.00
2.00        1.00        0.00
4.00        2.00        0.00
3.00        2.00        0.00
3.00        3.00        0.00

Example 5: Read from a file with a different delimiter

So far, all of our examples have assumed that the elements in our data file are separated by a comma. GAUSS does not come with any space or tab-separated files. But you can read them in by specifying the fourth argument as a string, containing the separator used in the file. For example:

//Tab separator
sep = "\t";

//Read all rows and all columns of a tab-separated file
x = csvReadM("my_tab_data.txt", 1, 1, sep);

//Space separator
sep = " ";

//Read all rows and all columns of a space-separated file
x = csvReadM("my_space_data.txt", 1, 1, sep);

Troubleshooting

The most common reason that csvReadM or csvReadSA will fail with an error is if the file does not exist, or is not in the directory that you specify. The GAUSS function filesa will return a string array of files that match a file specification. This can be helpful to double-check the existence of a file. For example, the code:

file_name = "housing.csv";

print filesa(file_name);

will return:

housing.csv

if housing.csv exists in our current working directory. If the file does not exist, the filesa command above will return an empty string. For example, if we misspell the filename for one of the examples above, like this:

//Misspell file name
file_name = "hhousing.csv";

//Attempt to read data from file that does not exist
x = csvReadM(file_name);

GAUSS will return an error. We can check for the existence of the file with filesa, like this:

//Print 'hhousing.csv' if that file is found
print filesa(file_name);

Since that file does not exist, we should get an empty string as the only output to the command above. Since filesa accepts the star '*' character for wildcards, we can use filesa to print out all files in our current working directory that end with .csv like this:

//Print all files in the current working directory
//that end with .csv
print filesa("*.csv");

If our current working directory is GAUSSHOME/examples, the output from this command should be:

beef_prices.csv
binary.csv
housing.csv

If we compare this output to the file_name variable, we should see that we need to change hhousing.csv to housing.csv.