From 2ed06e0ce2b167ef230783238e1e1098e9bc0381 Mon Sep 17 00:00:00 2001 From: Charles Plessy Date: Sun, 21 Dec 2014 13:43:15 +0900 Subject: [PATCH] Premiers pas en Haskell. --- Haskell/refSeqIdSymbol.html | 143 ++++++++++++++++++++ Haskell/refSeqIdSymbol.lhs | 263 ++++++++++++++++++++++++++++++++++++ 2 files changed, 406 insertions(+) create mode 100644 Haskell/refSeqIdSymbol.html create mode 100644 Haskell/refSeqIdSymbol.lhs diff --git a/Haskell/refSeqIdSymbol.html b/Haskell/refSeqIdSymbol.html new file mode 100644 index 00000000..17897f44 --- /dev/null +++ b/Haskell/refSeqIdSymbol.html @@ -0,0 +1,143 @@ +

A RefSeq parser that outputs the gene symbol for each ID

+

The goal

+

RefSeq is a database of reference biological (DNA, RNA or protein) seqences made by the National Center for Biological Information (NCBI, USA). In a project at work we wanted to list for each entry in the RNA database the unique version number of the entry and its gene symbol (a short human-readable name).

+

This looked like an excellent excuse to practice Haskell... this is the first program I wrote in that language. I did not yet manage to make it run fast enough compared to a minimalistic strategy using shell commands.

+

Here I describe the format that I want to parse, the Haskell program I wrote, and the quick-and-dirty processing with shell commands.

+

The GenBank format

+

Human sequences can be downloaded from NCBI's FTP site, in a structured format called GenBank. RefSeq entries typically start like the following one.

+
LOCUS       NM_131041               1399 bp    mRNA    linear   VRT 28-SEP-2014
+DEFINITION  Danio rerio neurogenin 1 (neurog1), mRNA.
+ACCESSION   NM_131041
+VERSION     NM_131041.1  GI:18859080
+KEYWORDS    RefSeq.
+SOURCE      Danio rerio (zebrafish)
+  ORGANISM  Danio rerio
+            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
+            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
+            Cypriniformes; Cyprinidae; Danio.
+

The record is made of fields, which start with their name in upper case at the beginning of a line. Similarly to e-mail headers, a field in the GenBank format includes the next line if this line starts with a space. Fields can be nested by indentation, like for the ORGANISM field above.

+

In the VERSION field, the version number is the first element (NM_131041.1 in the example above).

+

The gene symbol is contained in the FEATURE field. In the record NM_131041.1, it starts like the following.

+
FEATURES             Location/Qualifiers
+     source          1..1399
+                     /organism="Danio rerio"
+                     /mol_type="mRNA"
+                     /db_xref="taxon:7955"
+                     /chromosome="14"
+                     /map="14"
+     gene            1..1399
+                     /gene="neurog1"
+                     /gene_synonym="cb260; chunp6899; neurod3; ngn1; ngr1;
+                     zNgn1"
+                     /note="neurogenin 1"
+                     /db_xref="GeneID:30239"
+                     /db_xref="ZFIN:ZDB-GENE-990415-174"
+

The FEATURES field also contains indented sub-fields, but their name is not restricted to upper-case characters. These subfields are structured: their value starts with sequence coordinates, followed by a list of keys and values, where each pair of keys and values is separated by spaces and a slash '/'.

+

Lastly, GenBank records are terminated by a line that contains exactly two slashes (//) and nothing else.

+

The program

+

This whole file is written using literate Haskell. It can actually be compiled! In literate Haskell, everything is comment by default and the code is prefixed by '> '.

+
import Text.Parsec
+import Text.Parsec.String
+import Data.List (intercalate)
+import Data.List.Split (splitOn)
+

I used the Parsec package for parsing, as well as two helper functions from packages related to list handling.

+
main = do
+  r <- getContents
+  let rs = splitOn "//\n" r
+  mapM_ parseGbRecord rs
+

It does not seem possible to parse a stream of text: the whole parsing must be finished before a result is returned, which was not feasable with hundreds of megabytes of data. Therefore, in the main loop getting the contents from stdin, I split the records one after the other, and run the parser on each of them.

+
parseGbRecord :: String -> IO ()
+parseGbRecord r = case parse gbRecord "(stdin)" r of
+            Left e  -> do putStrLn "Error parsing input:"
+                          print e
+            Right r ->  putStrLn r
+

Parsec returns either an error message or the result of the parsing.

+

TODO: how about printing always ? What is the difference between pring and putStrLn ?

+
gbRecord = do
+  fs <- many field
+  return . intercalate "\t" $ filter (/= "") fs
+

A GenBank record is made of many fields. For each field the parser reports a string, so the result will be a list of strings. Some strings will be empty and discarded. The resulting list is collapsed in a tab-separated string.

+
field = do
+  f <- fieldName
+  many1 space
+  case f of
+    "VERSION"  -> getVersionNumber
+    "FEATURES" -> getGeneSymbol
+    _          -> endField >> return ""
+

A field starts with a field name, which is recorded. It is followed with multiple spaces. Since I am only intersted in version and gene symbol, if the field name was VERSION or FEATURES, more information is extracted, otherwise an empty string is returned.

+
fieldName = many1 upper
+endField  = manyTill anyChar (try separator <|> try eof)
+separator = newline >> notFollowedBy (char ' ')
+

A field name is upper case. Fields continue with any character until a separator or the end of the file is reached. A separator is a newline character not followed by a space.

+
getVersionNumber = do
+  let versionNumberChar = oneOf $ "NXRM_" ++ ['0'..'9'] ++ "."
+  v <- many versionNumberChar
+  endField
+  return v
+

A version number is a string containing the letters N, X, R or M, underscores, digits and dots. The precise definition is actually stricter: the version numbers start by NM_, NR_, XM_ or XR_, but I was worried of the performance hit if testing for this precisely.

+

Once the version number is read, it is recorded, and the rest of the field is discarded.

+
getGeneSymbol = do
+  let geneSymbolChar = oneOf $ ['A'..'Z'] ++ "orf" ++ "p" ++ "-" ++ ['0'..'9']
+  manyTill anyChar (try (string "/gene=\""))
+  g <- many geneSymbolChar
+  char '"'
+  endField
+  return g
+

Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or p, dashes and digits. It is recorded after finding the string /gene=". The parser then checks that the closing double quote is present, and discards it together with the rest of the field.

+

Quick-and-dirty parsing with Unix tools

+

With the following command, I could produce a list of version numbers and gene symbols in a minute or so. However, what it does is not very obvious, nor flexible as it does not follow the format's syntax, and is probably very fragile.

+
zcat human.*.rna.gbff.gz |
+  grep -e VERSION -e gene= |
+  uniq |
+  sed 's/=/ /' |
+  awk '{print $2}' |
+  tr '\n' '\t' |
+  sed -e 's/"\t/\n/g' -e 's/"//g' > human.rna.ID-Symbol.txt
+

In brief, it:

+ +

Speed

+

Unfortunately, the Haskell version is way too slow to process the full RefSeq data. Here is a comparison using a test file of only 476 Kib.

+
$ ghc -O2 refSeqIdSymbol.lhs
+[1 of 1] Compiling Main             ( refSeqIdSymbol.lhs, refSeqIdSymbol.o )
+Linking refSeqIdSymbol ...
+chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb 
+NM_001142483.1  NREP
+NM_001142481.1  NREP
+NM_001142480.1  NREP
+NM_001142477.1  NREP
+NM_001142475.1  NREP
+NM_001142474.1  NREP
+NM_004772.2 NREP
+NM_001142466.1  GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+
+real    0m0.701s
+user    0m0.664s
+sys 0m0.028s
+
$ time cat hopla.gb | grep -e VERSION -e gene= |   uniq |   sed 's/=/ /' |   awk '{print $2}' |   tr '\n' '\t' |   sed -e 's/"\t/\n/g' -e 's/"//g'
+NM_001142483.1  NREP
+NM_001142481.1  NREP
+NM_001142480.1  NREP
+NM_001142477.1  NREP
+NM_001142475.1  NREP
+NM_001142474.1  NREP
+NM_004772.2 NREP
+NM_001142466.1  GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+real    0m0.015s
+user    0m0.004s
+sys 0m0.004s
diff --git a/Haskell/refSeqIdSymbol.lhs b/Haskell/refSeqIdSymbol.lhs new file mode 100644 index 00000000..d4e33f9c --- /dev/null +++ b/Haskell/refSeqIdSymbol.lhs @@ -0,0 +1,263 @@ +A RefSeq parser that outputs the gene symbol for each ID +======================================================== + +The goal +-------- + +[RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) is a database of _ref_erence +biological (DNA, RNA or protein) _seq_ences made by the National Center for +Biological Information (NCBI, USA). In a project at work we wanted to list for +each entry in the RNA database the unique version number of the entry and its +gene symbol (a short human-readable name). + +This looked like an excellent excuse to practice [Haskell](https://www.haskell.org)... +this is the first program I wrote in that language. I did not yet manage to make +it run fast enough compared to a minimalistic strategy using shell commands. + +Here I describe the format that I want to parse, the Haskell program I wrote, +and the quick-and-dirty processing with shell commands. + + +The GenBank format +------------------ + +Human sequences can be downloaded from NCBI's +[FTP](ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/) site, in a +structured format called +[GenBank](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html). RefSeq +entries typically start like the following one. + +~~~~ +LOCUS NM_131041 1399 bp mRNA linear VRT 28-SEP-2014 +DEFINITION Danio rerio neurogenin 1 (neurog1), mRNA. +ACCESSION NM_131041 +VERSION NM_131041.1 GI:18859080 +KEYWORDS RefSeq. +SOURCE Danio rerio (zebrafish) + ORGANISM Danio rerio + Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; + Actinopterygii; Neopterygii; Teleostei; Ostariophysi; + Cypriniformes; Cyprinidae; Danio. +~~~~ + +The record is made of fields, which start with their name in upper case at the +beginning of a line. Similarly to [e-mail +headers](https://tools.ietf.org/html/rfc5322#section-2.2.3), a field in the +GenBank format includes the next line if this line starts with a space. Fields +can be nested by indentation, like for the ORGANISM field above. + +In the VERSION field, the version number is the first element (NM_131041.1 in +the example above). + +The gene symbol is contained in the FEATURE field. In the record NM_131041.1, it +starts like the following. + +~~~~ +FEATURES Location/Qualifiers + source 1..1399 + /organism="Danio rerio" + /mol_type="mRNA" + /db_xref="taxon:7955" + /chromosome="14" + /map="14" + gene 1..1399 + /gene="neurog1" + /gene_synonym="cb260; chunp6899; neurod3; ngn1; ngr1; + zNgn1" + /note="neurogenin 1" + /db_xref="GeneID:30239" + /db_xref="ZFIN:ZDB-GENE-990415-174" +~~~~ + +The FEATURES field also contains indented sub-fields, but their name is not +restricted to upper-case characters. These subfields are structured: their +value starts with sequence coordinates, followed by a list of keys and values, +where each pair of keys and values is separated by spaces and a slash '/'. + +Lastly, GenBank records are terminated by a line that contains exactly two +slashes (`//`) and nothing else. + + +The program +----------- + +This whole file is written using [literate +Haskell](https://www.haskell.org/haskellwiki/Literate_programming). It can +actually be compiled! In literate Haskell, everything is comment by default +and the code is prefixed by '> '. + +> import Text.Parsec +> import Text.Parsec.String +> import Data.List (intercalate) +> import Data.List.Split (splitOn) + +I used the [Parsec](http://hackage.haskell.org/package/parsec) package for +parsing, as well as two helper functions from packages related to list +handling. + +> main = do +> r <- getContents +> let rs = splitOn "//\n" r +> mapM_ parseGbRecord rs + +It does not seem possible to parse a stream of text: the whole parsing must be +finished before a result is returned, which was not feasable with hundreds of +megabytes of data. Therefore, in the main loop getting the contents from +_stdin_, I split the records one after the other, and run the parser on each of +them. + +> parseGbRecord :: String -> IO () +> parseGbRecord r = case parse gbRecord "(stdin)" r of +> Left e -> do putStrLn "Error parsing input:" +> print e +> Right r -> putStrLn r + +Parsec returns either an error message or the result of the parsing. + +TODO: how about printing always ? What is the difference between pring and putStrLn ? + +> gbRecord = do +> fs <- many field +> return . intercalate "\t" $ filter (/= "") fs + +A GenBank record is made of many fields. For each field the parser reports a +string, so the result will be a list of strings. Some strings will be empty +and discarded. The resulting list is collapsed in a tab-separated string. + +> field = do +> f <- fieldName +> many1 space +> case f of +> "VERSION" -> getVersionNumber +> "FEATURES" -> getGeneSymbol +> _ -> endField >> return "" + +A field starts with a field name, which is recorded. It is followed with +multiple spaces. Since I am only intersted in version and gene symbol, if the +field name was VERSION or FEATURES, more information is extracted, otherwise +an empty string is returned. + +> fieldName = many1 upper +> endField = manyTill anyChar (try separator <|> try eof) +> separator = newline >> notFollowedBy (char ' ') + +A field name is upper case. Fields continue with any character until a +separator or the end of the file is reached. A separator is a newline +character not followed by a space. + +> getVersionNumber = do +> let versionNumberChar = oneOf $ "NXRM_" ++ ['0'..'9'] ++ "." +> v <- many versionNumberChar +> endField +> return v + +A version number is a string containing the letters N, X, R or M, underscores, +digits and dots. The precise definition is actually stricter: the version +numbers start by NM_, NR_, XM_ or XR_, but I was worried of the performance hit +if testing for this precisely. + +Once the version number is read, it is recorded, and the rest of the field is +discarded. + +> getGeneSymbol = do +> let geneSymbolChar = oneOf $ ['A'..'Z'] ++ "orf" ++ "p" ++ "-" ++ ['0'..'9'] +> manyTill anyChar (try (string "/gene=\"")) +> g <- many geneSymbolChar +> char '"' +> endField +> return g + +Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or p, +dashes and digits. It is recorded after finding the string `/gene="`. The +parser then checks that the closing double quote is present, and discards it +together with the rest of the field. + + +Quick-and-dirty parsing with Unix tools +--------------------------------------- + +With the following command, I could produce a list of version numbers and gene +symbols in a minute or so. However, what it does is not very obvious, nor +flexible as it does not follow the format's syntax, and is probably very +fragile. + +~~~~ +zcat human.*.rna.gbff.gz | + grep -e VERSION -e gene= | + uniq | + sed 's/=/ /' | + awk '{print $2}' | + tr '\n' '\t' | + sed -e 's/"\t/\n/g' -e 's/"//g' > human.rna.ID-Symbol.txt +~~~~ + +In brief, it: + + - filters lines containing `VERSION` or `gene=` and discards the other ones; + + - removes consecutive duplicates (there are multiple features per locus + crosslinking to gene IDs), assuming that only one gene symbol is used per + record; + + - replaces `=` with a space so that the relevant information appears to be on + the second column if one considers the flowing text to be a space-separated + table; + + - keeps only the second column (at that point, the version numbers and gene symbols + alternate on successive lines; + + - replaces newlines by tabulations, so that the data is now a gigantic lines + where tab-separated fields alternate for version numbers and symbols; + + - uses the double quotes around the gene symbol as a landmark to transform the + flowing text into a tab-separated file with the two wanted fields; + + - cleans up the remaining double quotes. + +Speed +----- + +Unfortunately, the Haskell version is way too slow to process the full RefSeq +data. Here is a comparison using a test file of only 476 Kib. + +~~~~~ +$ ghc -O2 refSeqIdSymbol.lhs +[1 of 1] Compiling Main ( refSeqIdSymbol.lhs, refSeqIdSymbol.o ) +Linking refSeqIdSymbol ... +chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb +NM_001142483.1 NREP +NM_001142481.1 NREP +NM_001142480.1 NREP +NM_001142477.1 NREP +NM_001142475.1 NREP +NM_001142474.1 NREP +NM_004772.2 NREP +NM_001142466.1 GPT2 +NM_133443.2 GPT2 +NM_173685.2 NSMCE2 +NM_007058.3 CAPN11 + + +real 0m0.701s +user 0m0.664s +sys 0m0.028s +~~~~~ + +~~~~~ +$ time cat hopla.gb | grep -e VERSION -e gene= | uniq | sed 's/=/ /' | awk '{print $2}' | tr '\n' '\t' | sed -e 's/"\t/\n/g' -e 's/"//g' +NM_001142483.1 NREP +NM_001142481.1 NREP +NM_001142480.1 NREP +NM_001142477.1 NREP +NM_001142475.1 NREP +NM_001142474.1 NREP +NM_004772.2 NREP +NM_001142466.1 GPT2 +NM_133443.2 GPT2 +NM_173685.2 NSMCE2 +NM_007058.3 CAPN11 + +real 0m0.015s +user 0m0.004s +sys 0m0.004s +~~~~~ -- 2.47.3