tech

Scientists rename human genes since it is easier than teaching Microsoft Excel to read it right

Scientists have had to rename human genes because Microsoft Excel keeps reading gene symbols as dates.

While things seem to be relatively peaceful in the scientific community, questions have been raised about the decision to rename genes. The biggest question seems to be - Why is it easier to rename human genes than it is to change how Excel works?
While things seem to be relatively peaceful in the scientific community, questions have been raised about the decision to rename genes. The biggest question seems to be - Why is it easier to rename human genes than it is to change how Excel works? (Pixabay)

Each of the tens and thousands of genes in the human genome has a unique name and alphanumeric code, known as symbols, given to it that scientists use to coordinate their research. While all was good in the DNA-RNA space for a while, over the past year and a little more, some 27 human genes being renamed because Microsoft Excel kept reading the symbols as dates.

Now, this issue is not as surprising as one would like to think. Excel is used by scientists regularly to keep track of their work, however, the default settings on this spreadsheet software are designed for more mundane applications - like actually tracking dates. So, when a scientist types in a gene’s alphanumeric symbol on the spreadsheet, for example, MARCH1 - which is short for “Membrane Associated Ring-CH-Type Finger 1”, Excel converts that into - 1-March.

A study titled ‘Gene name errors are widespread in the scientific literature’ examined genetic data shared alongside 3,597 published papers and found that about one-fifth had been affected by Excel errors.
A study titled ‘Gene name errors are widespread in the scientific literature’ examined genetic data shared alongside 3,597 published papers and found that about one-fifth had been affected by Excel errors. (The Verge)

This is not just frustrating for the people on the job, but also dangerous because it can corrupt data that scientists will now have to sort through manually and restore. This is also a rather wide-spread error that affects peer-reviewed scientific work as well.

A study titled ‘Gene name errors are widespread in the scientific literature’ examined genetic data shared alongside 3,597 published papers and found that about one-fifth had been affected by Excel errors.

There’s no easy way to fix this since Excel does not give you an option to turn off auto-formatting. The only way to avoid it is to change the data type for each individual column. However, this is still counterproductive. While a scientist can fix his own errors, as soon as another person opens the same spreadsheet in Excel, the errors will be introduced all over again.

The only fix that scientists have found to be handy has come from the scientific body in charge of standardising the names of genes - HUGO Gene Nomenclature Committee or HGNC. HGNC has published new guidelines for gene naming this week that includes symbols that affect data handling and retrieval.

Now on, HGNC says, human genes and the proteins they express will be named keeping Excel’s auto-formatting in mind. So MARCH1 becomes MARCHF1, SEPT1 becomes SEPTIN1 etc. HGNC is going to keep a record of old symbols and names to avoid confusion.

Elspeth Bruford, the HGNC coordinator told The Verge that the names of 27 genes have been changed over the past year but the guidelines have been formally announced only this week.

Bruford said that they had consulted research communities to discuss proposed updates and also notified researchers who had published on these genes specifically when the new changes were being put into effect.

The art of naming genes, as Bruford explained, is driven greatly by consensus. HGNC has to be aware of the individual needs of the people who will be most affected by their work - just like lexicographers who update dictionaries.

HGNC’s focus was on practical concerns like minimising confusion. And for that reason they had to ensure that gene symbols are unique and gene names are brief and specific. These names cannot use subscript or superscript and can only contain Latin letters and Arabic numerals and should not spell out names or words, especially offensive ones.

This decision to rename genes is not unusual though, Bruford said. Many gene symbols that can be read as nouns have been renamed in the past to avoid false positives during searches. Like - CARS has been changed to CARS1, WARS to WARS1 and MARS to MARS1. Some others were changed to avoid insult.

Citing a case where a clinician has to explain to a parent that the child has a mutation in a particular gene, Bruford gave the example of a gene name ‘headcase homolog (Drosophila) that was named after the equivalent gene in a fruit fly. This was changed to ‘hdc homolog, cell cycle regulator’ to avoid offense.

But this is the first time the guidelines has been rewritten to specifically counter software problems. And the scientists are thrilled.

While things seem to be relatively peaceful in the scientific community, questions have been raised about the decision to rename genes. The biggest question seems to be - Why is it easier to rename human genes than it is to change how Excel works?

In a fight between Microsoft and the genetics community, why are the scientists backing down?

While Microsoft did not comment upon this, Bruford’s theory is that it’s not “worth the trouble to change” since this is quite a limited use case of Microsoft Excel. There’s very little incentive for Microsoft to make a significant change to a particular feature in a software that is so widely used for just one community.

Whether Microsoft incorporates this change or not remains to be seen but it doesn’t make sense for scientists to wait around for Excel to get a fix. A long-term solution has been found by scientists already. A spreadsheet is fleeting, as is a software - but these genes are going to be around so might as well give them names that work as names, and not dates.