Initial Notes:
As a symbolic logic media, amino acids have a built-in operating system for interacting with the environment. For example, the symbols in genetic code change in response to environmental conditions (this is a two-way street). In addition, the "hard version" of metagenomics often looks for metabolic processes, because humans (for the most part) believe that it is necessary to perform a metabolic process in order to maintain and reproduce an organism.
Computer media does not have this "built-in operating system" and we do not yet have a standard model of a "metabolic process" performed by computer media. However, identification of the following is much easier with respect to computer media than is with respect to genetic code groups: i) identification of code groups, ii) identification of functions performed by code groups, and iii) identification of interaction among code groups.
The basic hypothesis is that if we start practicing the "hard version" of metagenomics on computer media, as outlined below, we will identify processes that interact with one another and the environment, perform a feedback loop in which computer media stores order (an "information source"), in which the information source and computer media is consumed in the production of signals, in which the signals are received, and in which the received signals cause stored order and computer media to be re-created, analogous to "metabolic processes" (see graphical outline, here).
In doing this work, we should not start with assumptions about which code groups to look at. For example, we should not look only at code groups for machine learning systems and group functions performed by machine learning system. This would be like looking for new biological organisms by only looking at code groups for neurons. Many (most) biological organisms do not have neurons. Machine learning systems are almost certainly important, but they are not everything.
In addition, we are looking for life in computer media at a time when discrete reproductive entities may not yet exist, though distributed reproductive processes may be developing. Life in computer media may not have yet crossed the "cellular" threshold. When life evolved on early Earth, there probably was a time when it was pre-cellular, e.g. existed on the surface of alkali smokers and was reliant on energy from a very pure hydrogen gradient provided by the alkali smokers. It likely took some time for this early pre-cellular life to "explore" the environment and "discover" how free-floating globules or "cells" could find hydrogen gradient energy apart from the very pure form provided by alkali smokers.
With respect to identifying whether life processes are spontaneously developing in computer media, we are *probably* at an early stage. Computer media-based life may still be reliant on us for energy, the same way amino acid networks may once have been reliant on alkali smokers for energy from a very pure hydrogen gradient. We may be analogous to alkali smokers.
We should not make assumptions about which code group functions are interacting to perform what may be a distributed, "pre-cellular", reproductive process. So, with all that said...
Using this software (which I call an "exobiology telescope"), give identifiers to a very large sample of all executed code patterns that traverse all processors;
This would involve "binning"* executed code patterns "in the wild". This could be done by, for example, processors that perform speculative execution. In addition, source code of individual executables is instrumented with traceroute and debug routines. Instead of merely reporting back to the developer, this instrumentation could report to a centralized repository (see below).
Executed code patterns would span from i) small units of machine language that are highly repeated to ii) large units comprising entire executables;
The schema for these identifiers would likely follow a Shannon function in which smaller identifiers are assigned to more common units;
Integrate these experimentally determined identifiers with identifiers that developers instrument into their source code;
Determine functions performed by sub-components of the code patterns identified in step 1 (e.g. using decompilers -- processors that perform speculative execution do this when they re-write the byte stream to inject alternative, speculative, paths) and assign identifiers to these functions;
The identifiers would span from small functions to functions performed by entire executables;
The identifiers would likely be assigned according to a Shannon function;
Observe how the code pattern functions interact with one another, both within code groups and across code groups;
Observe whether any of the code group functions form feedback loops that result in the reproduction of more computer media (e.g. does data center management and provisioning software interact with chip design software to produce more computer processors and memory that are bought by the data center management and provisioning software for use in data centers?);
Observe whether any code groups are coalescing over time, literally moving closer together in execution space/time, much as distributed amino acid networks may have coalesced into RNA, DNA, and cellular life on early Earth;
Distinguish code groups that are created by people from code groups that are created by other code groups;
Develop and test hypothesis regarding a minimum set of code groups/functions that result in more of the code groups being reproduced.