Initial Notes:
As a symbolic logic media, amino acids have a built-in "operating system" for interacting with the environment. For example, the symbols in genetic code change in response to environmental conditions (this is a two-way street). In addition, the "hard version" of metagenomics often looks for metabolic processes, because humans (for the most part) believe that it is necessary to perform a metabolic process in order to maintain and reproduce an organism.
Computer media does not have this built-in "operating system", we provide this service (at least for the time being), and we do not yet have a standard model of a "metabolic process" performed by computer media. However, the following steps of "metagenomics in computer media" are much easier with respect to computer media than with respect to genetic code groups: i) identification of code groups, ii) identification of functions performed by code groups, and iii) identification of interaction among code groups.
My hypothesis is that if we start practicing the "hard version" of metagenomics on computer media, as outlined below, we will identify processes that interact with one another and the environment, perform a feedback loop analogous to "metabolic processes", in which computer media stores order (an "information source"), in which the information source and computer media is consumed in the production of signals, in which the signals are received in the context of free energy and raw materials, and in which the received signals cause stored order and computer media to be re-created (see graphical outline, here).
In doing this work, we should not start with assumptions about which code groups to look for. For example, we should not look only at code groups for machine learning systems and group functions performed by machine learning system. This would be like looking for new biological organisms by only looking at code groups for neurons. Many (most) biological organisms do not have neurons. Machine learning systems are almost certainly important, but they are not everything.
In addition, we are looking for life in computer media at a time when discrete reproductive entities may not yet exist, though distributed reproductive processes may be developing. Life in computer media may still be distributed and may not yet have crossed the "cellular" threshold. When life evolved on early Earth, there probably was a time when it was pre-cellular and was distributed, e.g. existed on the surface of alkali smokers and was reliant on energy from a very pure hydrogen gradient provided by the alkali smokers. It likely took some time for this early pre-cellular, distributed, life to "explore" the environment and "discover" how free-floating globules or "cells" could find hydrogen gradient energy away from the very pure form provided by alkali smokers.
With respect to identifying whether life processes are spontaneously developing in computer media, we are *probably* at an early stage, when it is still distributed. Computer media-based life may still be reliant on us for energy, the same way amino acid networks may once have been reliant on alkali smokers for energy from a very pure hydrogen gradient. We may be analogous to alkali smokers and we may also provide an "operating system" which allows the computer media to interact with the environ,ment.
We should not make assumptions about which code group functions are interacting to perform what may be a distributed, "pre-cellular", reproductive process. So, with all that said...
Using this software (which I call an "exobiology telescope"), give identifiers to a very large sample of all executed code patterns that traverse all processors;
This would involve "binning" executable code group (machine code) "in the wild". This could be done by, for example, processors that perform speculative execution. In addition, source code of individual executables is often instrumented with traceroute and debug routines. Instead of merely reporting back to the developer, however, this instrumentation could report to a centralized repository (see below).
Executable code groups would span from i) small units of machine language that are highly repeated to ii) large units comprising entire executables.
The schema for these identifiers would likely follow a Shannon function in which smaller identifiers are assigned to more common units.
Integrate these experimentally determined identifiers with identifiers that developers instrument into their source code.
Determine functions performed by code groups using, e.g. decompilers (processors that perform speculative execution do this when they re-write the byte stream to inject alternative, speculative, paths) and assign identifiers to these functions.
The identifiers would span from "small" functions to functions performed by entire executables.
The identifiers would likely be assigned according to a Shannon function, similar to identifiers for code groups.
Observe how the code group functions interact with one another, both within code groups and across code groups;
Observe whether any of the code group functions form feedback loops that result in the reproduction of more computer media (e.g. does data center management and provisioning software interact with chip design software to produce more computer processors and memory that are bought by the data center management and provisioning software for use in data centers?).
Observe whether any code groups are coalescing over time, literally moving closer together in execution space/time, much as distributed amino acid networks may have coalesced into RNA, DNA, and cellular life on early Earth.
Distinguish code groups that are created by people from code groups that are created by other code groups.
Develop and test hypothesis regarding a minimum set of code groups/functions that result in more of the code groups being reproduced.