Working with CSV and TSV data in ClickHouse
ClickHouse supports importing data from and exporting to CSV. Since CSV files can come with different format specifics, including header rows, custom delimiters, and escape symbols, ClickHouse provides formats and settings to address each case efficiently.
Importing data from a CSV file
Before importing data, let's create a table with a relevant structure:
To import data from the CSV file to the sometable table, we can pipe our file directly to the clickhouse-client:
Note that we use FORMAT CSV to let ClickHouse know we're ingesting CSV formatted data. Alternatively, we can load data from a local file using the FROM INFILE clause:
Here, we use the FORMAT CSV clause so ClickHouse understands the file format. We can also load data directly from URLs using url() function or from S3 files using s3() function.
We can skip explicit format setting for file() and INFILE/OUTFILE.
In that case, ClickHouse will automatically detect format based on file extension.
CSV files with headers
Suppose our CSV file has headers in it:
To import data from this file, we can use CSVWithNames format:
In this case, ClickHouse skips the first row while importing data from the file.
Starting from version 23.1, ClickHouse will automatically detect headers in CSV files when using the CSV format, so it is not necessary to use CSVWithNames or CSVWithNamesAndTypes.
CSV files with custom delimiters
In case the CSV file uses other than comma delimiter, we can use the format_csv_delimiter option to set the relevant symbol:
Now, when we import from a CSV file, ; symbol is going to be used as a delimiter instead of a comma.
Skipping lines in a CSV file
Sometimes, we might skip a certain number of lines while importing data from a CSV file. This can be done using input_format_csv_skip_first_lines option:
In this case, we're going to skip the first ten lines from the CSV file:
The file has 1k rows, but ClickHouse loaded only 990 since we've asked to skip the first 10.
When using the file() function, with ClickHouse Cloud you will need to run the commands in clickhouse client on the machine where the file resides. Another option is to use clickhouse-local to explore files locally.
Treating NULL values in CSV files
Null values can be encoded differently depending on the application that generated the file. By default, ClickHouse uses \N as a Null value in CSV. But we can change that using the format_csv_null_representation option.
Suppose we have the following CSV file:
If we load data from this file, ClickHouse will treat Nothing as a String (which is correct):
If we want ClickHouse to treat Nothing as NULL, we can define that using the following option:
Now we have NULL where we expect it to be:
TSV (tab-separated) files
Tab-separated data format is widely used as a data interchange format. To load data from a TSV file to ClickHouse, the TabSeparated format is used:
There's also a TabSeparatedWithNames format to allow working with TSV files that have headers. And, like for CSV, we can skip the first X lines using the input_format_tsv_skip_first_lines option.
Raw TSV
Sometimes, TSV files are saved without escaping tabs and line breaks. We should use TabSeparatedRaw to handle such files.
Exporting to CSV
Any format in our previous examples can also be used to export data. To export data from a table (or a query) to a CSV format, we use the same FORMAT clause:
To add a header to the CSV file, we use the CSVWithNames format:
Saving exported data to a CSV file
To save exported data to a file, we can use the INTO...OUTFILE clause:
Note how it took ClickHouse ~1 second to save 36m rows to a CSV file.
Exporting CSV with custom delimiters
If we want to have other than comma delimiters, we can use the format_csv_delimiter settings option for that:
Now ClickHouse will use | as a delimiter for CSV format:
Exporting CSV for Windows
If we want a CSV file to work fine in a Windows environment, we should consider enabling output_format_csv_crlf_end_of_line option. This will use \r\n as a line breaks instead of \n:
Schema inference for CSV files
We might work with unknown CSV files in many cases, so we have to explore which types to use for columns. Clickhouse, by default, will try to guess data formats based on its analysis of a given CSV file.  This is known as "Schema Inference". Detected data types can be explored using the DESCRIBE statement in pair with the file() function:
Here, ClickHouse could guess column types for our CSV file efficiently. If we don't want ClickHouse to guess, we can disable this with the following option:
All column types will be treated as a String in this case.
Exporting and importing CSV with explicit column types
ClickHouse also allows explicitly setting column types when exporting data using CSVWithNamesAndTypes (and other *WithNames formats family):
This format will include two header rows - one with column names and the other with column types. This will allow ClickHouse (and other apps) to identify column types when loading data from such files:
Now ClickHouse identifies column types based on a (second) header row instead of guessing.
Custom delimiters, separators, and escaping rules
In sophisticated cases, text data can be formatted in a highly custom manner but still have a structure. ClickHouse has a special CustomSeparated format for such cases, which allows setting custom escaping rules, delimiters, line separators, and starting/ending symbols.
Suppose we have the following data in the file:
We can see that individual rows are wrapped in row(), lines are separated with , and individual values are delimited with ;. In this case, we can use the following settings to read data from this file:
Now we can load data from our custom formatted file:
We can also use CustomSeparatedWithNames to get headers exported and imported correctly. Explore regex and template formats to deal with even more complex cases.
Working with large CSV files
CSV files can be large, and ClickHouse works efficiently with files of any size. Large files usually come compressed, and ClickHouse covers this with no need for decompression before processing. We can use a COMPRESSION clause during an insert:
If a COMPRESSION clause is omitted, ClickHouse will still try to guess file compression based on its extension. The same approach can be used to export files directly to compressed formats:
This will create a compressed data_csv.csv.gz file.
Other formats
ClickHouse introduces support for many formats, both text, and binary, to cover various scenarios and platforms. Explore more formats and ways to work with them in the following articles:
- CSV and TSV formats
- Parquet
- JSON formats
- Regex and templates
- Native and binary formats
- SQL formats
And also check clickhouse-local - a portable full-featured tool to work on local/remote files without the need for Clickhouse server.
