From: Charles Plessy Date: Sun, 21 Dec 2014 05:04:53 +0000 (+0900) Subject: Nettoyage. X-Git-Url: https://source.charles.plessy.org/?a=commitdiff_plain;h=2bcab614c6e98c3df10cad9d4ef48c42430c9a7f;p=source%2F.git Nettoyage. --- diff --git a/Haskell/refSeqIdSymbol.en.po b/Haskell/refSeqIdSymbol.en.po deleted file mode 100644 index 28ffb7a5..00000000 --- a/Haskell/refSeqIdSymbol.en.po +++ /dev/null @@ -1,469 +0,0 @@ -# SOME DESCRIPTIVE TITLE -# Copyright (C) YEAR Free Software Foundation, Inc. -# This file is distributed under the same license as the PACKAGE package. -# FIRST AUTHOR , YEAR. -# -#, fuzzy -msgid "" -msgstr "" -"Project-Id-Version: PACKAGE VERSION\n" -"POT-Creation-Date: 2014-12-21 04:48+0000\n" -"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" -"Last-Translator: FULL NAME \n" -"Language-Team: LANGUAGE \n" -"Language: \n" -"MIME-Version: 1.0\n" -"Content-Type: text/plain; charset=UTF-8\n" -"Content-Transfer-Encoding: 8bit\n" - -#. type: Content of:

-msgid "A RefSeq parser that outputs the gene symbol for each ID" -msgstr "" - -#. type: Content of:

-msgid "The goal" -msgstr "" - -#. type: Content of:

-msgid "" -"RefSeq is a database of " -"reference biological (DNA, RNA or protein) seqences made " -"by the National Center for Biological Information (NCBI, USA). In a project " -"at work we wanted to list for each entry in the RNA database the unique " -"version number of the entry and its gene symbol (a short human-readable " -"name)." -msgstr "" - -#. type: Content of:

-msgid "" -"This looked like an excellent excuse to practice Haskell... this is the first program I " -"wrote in that language. I did not yet manage to make it run fast enough " -"compared to a minimalistic strategy using shell commands." -msgstr "" - -#. type: Content of:

-msgid "" -"Here I describe the format that I want to parse, the Haskell program I " -"wrote, and the quick-and-dirty processing with shell commands." -msgstr "" - -#. type: Content of:

-msgid "The GenBank format" -msgstr "" - -#. type: Content of:

-msgid "" -"Human sequences can be downloaded from NCBI's FTP " -"site, in a structured format called GenBank. " -"RefSeq entries typically start like the following one." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"LOCUS       NM_131041               1399 bp    mRNA    linear   VRT "
-"28-SEP-2014\n"
-"DEFINITION  Danio rerio neurogenin 1 (neurog1), mRNA.\n"
-"ACCESSION   NM_131041\n"
-"VERSION     NM_131041.1  GI:18859080\n"
-"KEYWORDS    RefSeq.\n"
-"SOURCE      Danio rerio (zebrafish)\n"
-"  ORGANISM  Danio rerio\n"
-"            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; "
-"Euteleostomi;\n"
-"            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;\n"
-"            Cypriniformes; Cyprinidae; Danio."
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"The record is made of fields, which start with their name in upper case at " -"the beginning of a line. Similarly to e-mail " -"headers, a field in the GenBank format includes the next line if this " -"line starts with a space. Fields can be nested by indentation, like for the " -"ORGANISM field above." -msgstr "" - -#. type: Content of:

-msgid "" -"In the VERSION field, the version number is the first element (NM_131041.1 " -"in the example above)." -msgstr "" - -#. type: Content of:

-msgid "" -"The gene symbol is contained in the FEATURE field. In the record " -"NM_131041.1, it starts like the following." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"FEATURES             Location/Qualifiers\n"
-"     source          1..1399\n"
-"                     /organism="Danio rerio"\n"
-"                     /mol_type="mRNA"\n"
-"                     /db_xref="taxon:7955"\n"
-"                     /chromosome="14"\n"
-"                     /map="14"\n"
-"     gene            1..1399\n"
-"                     /gene="neurog1"\n"
-"                     /gene_synonym="cb260; chunp6899; neurod3; ngn1; "
-"ngr1;\n"
-"                     zNgn1"\n"
-"                     /note="neurogenin 1"\n"
-"                     /db_xref="GeneID:30239"\n"
-"                     /db_xref="ZFIN:ZDB-GENE-990415-174""
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"The FEATURES field also contains indented sub-fields, but their name is not " -"restricted to upper-case characters. These subfields are structured: their " -"value starts with sequence coordinates, followed by a list of keys and " -"values, where each pair of keys and values is separated by spaces and a " -"slash '/'." -msgstr "" - -#. type: Content of:

-msgid "" -"Lastly, GenBank records are terminated by a line that contains exactly two " -"slashes (//) and nothing else." -msgstr "" - -#. type: Content of:

-msgid "The program" -msgstr "" - -#. type: Content of:

-msgid "" -"This whole file is written using literate " -"Haskell. It can actually be compiled! In literate Haskell, everything is " -"comment by default and the code is prefixed by '> '." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"import Text.Parsec\n"
-"import Text.Parsec.String\n"
-"import Data.List "
-"(intercalate)\n"
-"import Data.List.Split "
-"(splitOn)"
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"I used the Parsec " -"package for parsing, as well as two helper functions from packages related " -"to list handling." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"main = do\n"
-"  r <- getContents\n"
-"  let rs = splitOn "//\\n" r\n"
-"  mapM_ parseGbRecord rs"
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"It does not seem possible to parse a stream of text: the whole parsing must " -"be finished before a result is returned, which was not feasable with " -"hundreds of megabytes of data. Therefore, in the main loop getting the " -"contents from stdin, I split the records one after the other, and " -"run the parser on each of them." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"parseGbRecord "
-":: String -> "
-"IO ()\n"
-"parseGbRecord r = case "
-"parse gbRecord "(stdin)" r of\n"
-"            Left e  -> do putStrLn "Error parsing input:"\n"
-"                          print e\n"
-"            Right r ->  putStrLn r"
-msgstr ""
-
-#. type: Content of: 

-msgid "Parsec returns either an error message or the result of the parsing." -msgstr "" - -#. type: Content of:

-msgid "" -"TODO: how about printing always ? What is the difference between pring and " -"putStrLn ?" -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"gbRecord = "
-"do\n"
-"  fs <- many field\n"
-"  return . intercalate "\\t" $ filter "
-"(/= "") "
-"fs"
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"A GenBank record is made of many fields. For each field the parser reports a " -"string, so the result will be a list of strings. Some strings will be empty " -"and discarded. The resulting list is collapsed in a tab-separated string." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"field = do\n"
-"  f <- fieldName\n"
-"  many1 space\n"
-"  case f of\n"
-"    "VERSION"  -> getVersionNumber\n"
-"    "FEATURES" -> getGeneSymbol\n"
-"    _          -> endField >> return """
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"A field starts with a field name, which is recorded. It is followed with " -"multiple spaces. Since I am only intersted in version and gene symbol, if " -"the field name was VERSION or FEATURES, more information is extracted, " -"otherwise an empty string is returned." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"fieldName = "
-"many1 upper\n"
-"endField  = manyTill anyChar (try separator <|> try eof)\n"
-"separator = newline >> notFollowedBy (char ' "
-"')"
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"A field name is upper case. Fields continue with any character until a " -"separator or the end of the file is reached. A separator is a newline " -"character not followed by a space." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"getVersionNumber = do\n"
-"  let versionNumberChar = oneOf $ "NXRM_" ++ ['0'..'9'] ++ "."\n"
-"  v <- many versionNumberChar\n"
-"  endField\n"
-"  return v"
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"A version number is a string containing the letters N, X, R or M, " -"underscores, digits and dots. The precise definition is actually stricter: " -"the version numbers start by NM_, NR_, XM_ or XR_, but I was worried of the " -"performance hit if testing for this precisely." -msgstr "" - -#. type: Content of:

-msgid "" -"Once the version number is read, it is recorded, and the rest of the field " -"is discarded." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"getGeneSymbol = "
-"do\n"
-"  let geneSymbolChar = "
-"oneOf $ ['A'..'Z'] ++ "orf" ++ "p" ++ "-" ++ ['0'..'9']\n"
-"  manyTill anyChar (try (string "/gene=\\""))\n"
-"  g <- many geneSymbolChar\n"
-"  char '"'\n"
-"  endField\n"
-"  return g"
-msgstr ""
-
-#. type: Content of: 

-msgid "" -"Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or " -"p, dashes and digits. It is recorded after finding the string " -"/gene=". The parser then checks that the closing double " -"quote is present, and discards it together with the rest of the field." -msgstr "" - -#. type: Content of:

-msgid "Quick-and-dirty parsing with Unix tools" -msgstr "" - -#. type: Content of:

-msgid "" -"With the following command, I could produce a list of version numbers and " -"gene symbols in a minute or so. However, what it does is not very obvious, " -"nor flexible as it does not follow the format's syntax, and is probably very " -"fragile." -msgstr "" - -#. type: Content of:

-#, no-wrap
-msgid ""
-"zcat human.*.rna.gbff.gz |\n"
-"  grep -e VERSION -e gene= |\n"
-"  uniq |\n"
-"  sed 's/=/ /' |\n"
-"  awk '{print $2}' |\n"
-"  tr '\\n' '\\t' |\n"
-"  sed -e 's/"\\t/\\n/g' -e 's/"//g' > "
-"human.rna.ID-Symbol.txt"
-msgstr ""
-
-#. type: Content of: 

-msgid "In brief, it:" -msgstr "" - -#. type: Content of:

  • -msgid "" -"filters lines containing VERSION or gene= and " -"discards the other ones;" -msgstr "" - -#. type: Content of:

    • -msgid "" -"removes consecutive duplicates (there are multiple features per locus " -"crosslinking to gene IDs), assuming that only one gene symbol is used per " -"record;" -msgstr "" - -#. type: Content of:

      • -msgid "" -"replaces = with a space so that the relevant information " -"appears to be on the second column if one considers the flowing text to be a " -"space-separated table;" -msgstr "" - -#. type: Content of:

        • -msgid "" -"keeps only the second column (at that point, the version numbers and gene " -"symbols alternate on successive lines;" -msgstr "" - -#. type: Content of:

          • -msgid "" -"replaces newlines by tabulations, so that the data is now a gigantic lines " -"where tab-separated fields alternate for version numbers and symbols;" -msgstr "" - -#. type: Content of:

            • -msgid "" -"uses the double quotes around the gene symbol as a landmark to transform the " -"flowing text into a tab-separated file with the two wanted fields;" -msgstr "" - -#. type: Content of:

              • -msgid "cleans up the remaining double quotes." -msgstr "" - -#. type: Content of:

                -msgid "Speed" -msgstr "" - -#. type: Content of:

                -msgid "" -"Unfortunately, the Haskell version is way too slow to process the full " -"RefSeq data. Here is a comparison using a test file of only 476 Kib." -msgstr "" - -#. type: Content of:

                -#, no-wrap
                -msgid ""
                -"$ ghc -O2 refSeqIdSymbol.lhs\n"
                -"[1 of 1] Compiling Main             ( refSeqIdSymbol.lhs, refSeqIdSymbol.o "
                -")\n"
                -"Linking refSeqIdSymbol ...\n"
                -"chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb \n"
                -"NM_001142483.1  NREP\n"
                -"NM_001142481.1  NREP\n"
                -"NM_001142480.1  NREP\n"
                -"NM_001142477.1  NREP\n"
                -"NM_001142475.1  NREP\n"
                -"NM_001142474.1  NREP\n"
                -"NM_004772.2 NREP\n"
                -"NM_001142466.1  GPT2\n"
                -"NM_133443.2 GPT2\n"
                -"NM_173685.2 NSMCE2\n"
                -"NM_007058.3 CAPN11\n"
                -"\n"
                -"\n"
                -"real    0m0.701s\n"
                -"user    0m0.664s\n"
                -"sys 0m0.028s"
                -msgstr ""
                -
                -#. type: Content of: 
                -#, no-wrap
                -msgid ""
                -"$ time cat hopla.gb | grep -e VERSION -e gene= |   uniq |   sed "
                -"'s/=/ /' |   awk '{print $2}' |   tr '\\n' "
                -"'\\t' |   sed -e 's/"\\t/\\n/g' -e "
                -"'s/"//g'\n"
                -"NM_001142483.1  NREP\n"
                -"NM_001142481.1  NREP\n"
                -"NM_001142480.1  NREP\n"
                -"NM_001142477.1  NREP\n"
                -"NM_001142475.1  NREP\n"
                -"NM_001142474.1  NREP\n"
                -"NM_004772.2 NREP\n"
                -"NM_001142466.1  GPT2\n"
                -"NM_133443.2 GPT2\n"
                -"NM_173685.2 NSMCE2\n"
                -"NM_007058.3 CAPN11\n"
                -"\n"
                -"real    0m0.015s\n"
                -"user    0m0.004s\n"
                -"sys 0m0.004s"
                -msgstr ""
                diff --git a/kunpuu/sources.en.po b/kunpuu/sources.en.po
                deleted file mode 100644
                index a62ac37e..00000000
                --- a/kunpuu/sources.en.po
                +++ /dev/null
                @@ -1,23 +0,0 @@
                -# SOME DESCRIPTIVE TITLE
                -# Copyright (C) YEAR Free Software Foundation, Inc.
                -# This file is distributed under the same license as the PACKAGE package.
                -# FIRST AUTHOR , YEAR.
                -#
                -#, fuzzy
                -msgid ""
                -msgstr ""
                -"Project-Id-Version: PACKAGE VERSION\n"
                -"POT-Creation-Date: 2014-12-21 04:48+0000\n"
                -"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
                -"Last-Translator: FULL NAME \n"
                -"Language-Team: LANGUAGE \n"
                -"Language: \n"
                -"MIME-Version: 1.0\n"
                -"Content-Type: text/plain; charset=UTF-8\n"
                -"Content-Transfer-Encoding: 8bit\n"
                -
                -#. type: Content of: outside any tag (error?)
                -msgid ""
                -"http://photo.charles.plessy.org/kunpuu"
                -msgstr ""
                diff --git a/kunpuu/sources.html b/kunpuu/sources.html
                deleted file mode 100644
                index 6af2573c..00000000
                --- a/kunpuu/sources.html
                +++ /dev/null
                @@ -1 +0,0 @@
                -http://photo.charles.plessy.org/kunpuu