AI Caveats I: The temptation to generate data yourself

AI procedures are changing and optimizing more and more areas of the economy and society. Especially deep neural networks are characterized by an enormous efficiency and are able to solve various tasks in an intelligent way, which until now has only been possible by humans. Examples are the classification of films according to content, playing poker on a superhuman level or creating deceptively real looking faces.
However, it is all too easy to forget that artificial intelligence is fundamentally different from natural intelligence. While the latter inherently uses “worldviews”, the former is ultimately based on the highly complex interpolation of countless data. Data are therefore the prerequisite for any AI. The more complex (and usually more powerful) an AI model is, the more data it needs for training.
Therefore, it becomes especially problematic if not enough data can be used. The reasons for this can be that the data is too expensive, too complex to obtain or process, or simply not available. E.g., one can hardly get enough data from crypto currency time series to reliably train an LSTM model.
One way out of this dilemma are embeddings. Here, components of models are already trained with other data and thus the “prior knowledge” from these are used. This procedure can lead to astonishing results, as long as the “previous knowledge” is relevant for the respective case. For example, a partial model trained for pattern recognition can be used as a component for a facial recognition procedure. On the other hand, a full model that has been trained to distinguish between dogs and cats will not really work as a facial recognition component.
Another – tempting – way out is to generate your own training (and test) data. E.g., missing crypto currency time series could simply be generated by a simulation and then used as input of the neural network. However, it is important to note that the generated data does not have to reflect reality! In fact, each (simulation) model is only a simplified representation of reality and is always flawed. A Monte Carlo simulation for the creation of crypto currency time series, for example, will be based on distribution or historical assumptions and will always provide only a distorted image of reality. The same applies to all data generated by simulation, whether on customer behaviour, weather, economic development or click rates.
Thus, one must always be aware that a neural network trained with generated data “learns” the underlying image of reality, but by no means the reality itself. Such a model can hold its own in the simulated world (such as achieving phenomenal share gains) and still fail in reality.
When generating data, one must therefore always ask oneself: What is the impact on the result? Are the results usable in reality? And finally: Is an AI procedure required here at all or can the results also be achieved by simple rule-based procedures?

Leave a Reply

Notify of