Application of Deep Learning in Art Therapy

—Art therapy is a non-verbal psychotherapy that diagnoses and treats human psychology through the medium of arts. It is focusing on the characteristics that human psychology, especially unconsciousness, appears directly through non-verbal forms rather than specific language. It is used in various fields such as psychotherapy and rehabilitation, and is mainly used for psychotherapy of children who have difficulty expressing their feelings in a specific language. Art therapists interpret symbolic meanings shown in the drawings to diagnose the psychological state of the counselee, and record them as text. But, during this process, interpretation and diagnosis may be affected by the therapist’s subjectivity and experience. Therefore, it is necessary to improve the reliability and objectivity of therapy by automating some of process. For this purpose, in this paper, we propose a CNN(Convolutional Neural Network)-based deep learning method for art therapy. Researches that classify images and generate captions using deep learning models have been actively studied in the field of computer vision and natural language processing. Especially, state of the art has been achieved by applying CNN-based image deep learning models and transfer learning using pre-trained model on large amounts of data. In this paper, we present a CNN model that finds symbolic features in drawings that can be used as a clue in the process of art therapy. Specifically, we apply the image captioning and attention techniques of deep learning to identify psychological features in each drawing. After key features in drawings have been identified and summarized through the proposed methodology, a psychotherapist can make consistent and standardized interpretation based on this in more efficient way. We expect that the proposed methodology may contribute to increase of reliability and objectivity of art therapy.


I. INTRODUCTION
Human psychology, especially unconsciousness, is known to be difficult to formulate and appears directly through non-verbal forms rather than specific language. Based on these characteristics, researches and attempts on art therapy, a non-verbal psychological therapy that diagnoses and treats human psychology through the medium of arts, are being actively conducted. It is used in various fields such as psychotherapy, education and rehabilitation, and is mainly used for psychotherapy of children who have difficulty in expressing their feelings in a specific language. The most representative method of art therapy with drawings is HTP (House-Tree-Person) test [1]. This test tries to understand psychological aspects such as cognition, emotion, and human relationship through three symbolic objects. Children's Manuscript received May 12, 2020; revised January 5, 2021. The authors are with the Graduate School of Business IT of Kookmin University, Seoul, Korea (e-mail: {jeung722, yunyi94, lkh5021, kykwahk, ngkim}@kookmin.ac.kr). unconsciousness is expressed through drawings, and art therapists interpret symbolic meanings shown in drawings to diagnose the psychological state of the counselee, and record them as text.
But during this process, interpretation and diagnosis may be affected by the therapist's subjectivity and experience. This implies that the interpretation could reveal to be different even for the same drawing depending on therapists. This phenomenon could reduce the public trust about the art therapy. Therefore, it is necessary to improve the reliability and objectivity of therapy by automating some of process based on AI. Although some researches to interpret results of the psychological test have been performed, researches using deep learning algorithm for the same purpose are not familiar in psychological domain.
To make a consistent diagnosis in the art therapy, there is a need for an art therapy support system using a deep learning-based image captioning model that can generate interpretations automatically for given drawings. However, captioning for art therapy has quite different purposes with traditional image captioning. Contrary to the traditional image captioning that focuses on describing factual context of images, it is more important to figure out the symbolic meaning of drawing in the case of image captioning in the art therapy. Unlike traditional image captioning models, therefore, an image interpretation model firstly needs to find out key features that are regarded as significant factors for diagnosis of therapist such as a component of house, a number of windows in the house, and shape of roots of a tree.
Researches that classify images and generate captions using deep learning models have been accomplished actively in the field of computer vision and NLP (Natural Language Processing). Especially, state of the art has been achieved by applying CNN (Convolutional Neural Network)-based image deep learning models [2]- [5] and transfer learning using pre-trained model on large amounts of data. However, traditional deep learning models usually not focus on detecting not key factors that have important meanings in the art therapy but the general context of images. Therefore, a different approach is required for applying image deep learning models to capture and interpret psychological images.
In this paper, we introduce the concept of deep learning-based image captioning model for art therapy support system and show some results of our preliminary experiments. Specifically, we present a CNN model that finds symbolic features that have important meanings in art therapy. And we also provide some results of our preliminary experiments for image classification and psychological feature detection. In our experiments, we used drawing images acquired from HTP test. We performed two individual experiments for image classification and symbolic feature detection, and summarized the results.

Application of Deep Learning in Art Therapy
International Journal of Machine Learning and Computing, Vol. 11, No. 6, November 2021 II. RELATED WORKS In this section, we briefly review recent works on image captioning & HTP test.

A. Object Recognition
Our model's main concept is based on a CNN, which is introduced by LeCun in 1989 [2]. This study yielded meaningful results in handwriting recognition. Many attempts on developing CNN have been activated since 1998, when LeCun proposed Lenet architecture [6]. This model is considered to be one of the main streams of deep learning model. Especially, CNN is very useful to detect image. This model can find a pattern to recognize object by learning directly from image data.
Many research areas such as computing vision, object classification, and object recognition have made advances by using CNN. Object recognition using CNN usually take two steps firstly to find the space where things are located, and secondly to distinguish the types of things they found. RCNN (Region-based Convolutional Neural Network) [7] is a representative and powerful two-Stage Method example. However, since RCNN has problems with the speed of computation, further studies have been conducted to solve this problem, including Fast RCNN, Faster RCNN [8], [9].

B. Image Captioning
Base on this development, researches have been conducted to not only recognize image objects but also describe them. To understand the relationship of features and express it as a natural language, image captioning requires understanding of language model as well as object capturing. To do this, many researches have been conducted using RNN (Recurrent Neural Network), which is considered as another axis of deep learning model. CNN+RNN structure inspired by Show and Tell [10] improved image captioning performance. Encoding using CNN was decoded by using RNN to generate a sentence of image. As a result, this study made a single-pint model that can maximize the likelihood of target sequence of a words as soon as image input is entered.
With efforts to improve the performance of image capturing, studies are being conducted from various aspects. First of all, researches have been performed to analyze the focus of corresponding sentence of given image by using attention mechanism that is recently being spotlighted [11]. By using attention mechanism, we can acquire not only more accurate description of image but also more accurate information in terms of captions.
There have been interesting studies for different styles of description. For example, there was a study that gave variation to the style of description, contrary to the existing fact-oriented captioning. Gan et al.(2017) proposed a method to write the description of images in terms of romantic style and Humorous style [12], and Mathes et al.(2018) proposed a method to create a description in the form of a story using a semantic term [13].
In addition, investigation on dense captioning has been performed [14]. Dense captioning deals with captioning from a detailed perspective rather than from an overall perspective. In this method, each input image is divided into parts to create detailed descriptions because the various aspects in an image cannot be perfectly described from the whole point of view. In this way, there have been tremendous researches to detect the details of the image and to generate a description in various styles.

C. Transfer Learning
One of the most important factors that affects the accuracy of machine learning model would be the quantity and quality of the data for training. Therefore, if there are not enough data, we cannot guarantee the performance of the model. Among many researches [15] to solve this problem, transfer learning technique has received lots of attention recently. By using existing models to create new models, transfer learning can achieve enhanced predictive power, learning time reduction, and reduction of human resource consumption. Transfer learning yields various forms of following researches and becomes a big trend in the areas of computer vision research [16].
Domain adaptation would be one of the most important issues in the field of transfer learning research. If we do not have sufficient amount of high quality data for training, we can rather build a new model using an existing model trained from the other domain with sufficient data [17]. The main concept of domain adaptation is to use the already known knowledge to learn new situations. The concept of domain adaptation has been applied to GAN (Generative Adversarial Network), and has achieved large performance improvement [18].

D. HTP Test
HTP test was first introduced in Buck (1948) [1], which analyzes the personality, perception, and emotions of an individual through house, tree, and human paintings. It has continued to be used as a major method of art therapy. However, it has a limitation that interpretation and diagnosis may be affected by the therapist's subjectivity and experience. This implies that the interpretation could reveal to be different even for the same drawing depending on therapists.
To make a consistent diagnosis in art therapy, a few studies have recently emerged to partially automate some process of HTP tests.
To assist interpretation of HTP test, experiment was conducted to classify and label each object [19]. This proves that recognition of objects can be successfully achieved when simple sketching images are used. Kim et al. (2005) invented an automating program that can check key features of HTP test image [20]. Based on the HTP test data, features needed for further psychological analysis were digitized. It can be used as a supportive indicator to analyze psychological symptoms after some kind of post processing.

III. PROPOSED METHODOLOGY
In this section, we propose a new psychological feature detection model for art therapy support system. Fig. 1 describes the overall process of our methodology.
Our research model has three sub-processes, ⅰ) Object Classification, ⅱ) Psychological Feature Detection, and ⅲ) Caption Generation. Contrary to traditional researches on image captioning, we manage psychological feature detection step to figure out meaningful features in HTP drawing images and interpret those features with psychological knowledge. We explain the three sub-processes in detail in the following subsections. First process of our model is Object Classification. We use not only HTP drawing images but also pre-trained model from general image set for accuracy of object classification. In this process, each object is labeled as 'house' or 'tree'. After labeling objects, pre-trained model and ResNet algorithm is applied to classify new images into predefined labels. ResNet can map inputs and outputs more efficiently by learning residual values and using skip connection. Fig. 2 shows concept of residual learning simply. Finally, by freezing layers of ResNet50 model to avoid sensitive changes and adding CNN layer, object-focused image vectors are returned as outputs of this process. The second step of our methodology is Psychological Feature Detection. In this step, we detect each psychological object and its features from HTP drawing images. For example, we can detect three windows and one big wooden door from a drawing of a house. We need this step because art therapists utilize features of each component as well as whole image for diagnosis. features we detected are regarded to have symbolic meanings for representing psychological status of counselee.
If a certain drawing image has three windows in its house, we label this picture with '3 windows'. After labeling multiple features to each drawing image, we input all these images into basic CNN model so this model can classify all drawing images by given features. This model is expected to return feature-focused image vectors as an output. Each image is represented as corresponding vector with highlight for pre-detected features. If a certain drawing image has '3 windows' as a label, vectors of windows are expected to be more emphasized than other portions in the image. In this manner, it is possible to find out other features that have specific meaning for counselee's psychological status. The result of above two steps can discover features of objects in each image. By concatenating those two output vectors, we can obtain HTP image vectors with information about psychological features.
A final process of our methodology is Caption Generation for HTP images. In this process, we use attention mechanism to find important parts that contain psychologically meaningful features in the whole drawing images. Attention mechanism is a variation of RNN-based seq2seq model. With using an attention model, it is possible to figure out significant portions in input images. Since above two steps generate image vectors with highlighted psychological features, attention mechanism then can focus to the highlighted areas in the process of caption generation.
To apply attention model to our caption generation, pre-trained image captioning model needs to be used. With pre-trained captioning model, result of training is represented as an attention weight of related terms for each image. Those weights can be utilized to generate captions of HTP drawing images by filtering and selecting only highly weighted terms among all of parsed sample captions in pre-trained model. Our proposed methodology finishes the process by generating captions of HTP drawing images that are helpful to interpret psychological status.

A. Image Captioning with Attention
In this preliminary experiment, we attempt to check whether the existing captioning model works well even with sketch image data. For this experiment, we used MS COCO image captioning data set in 2014, which contains 80,000 train data set and 40,000 validation data set. We used the model proposed by Vinyals et al. (2015) [10]. It is based on seq2seq model with visual attention and widely used for image caption generation. The model consists of a CNN encoder for learning image features and a LSTM (Long Short-Term Memory) decoder for generating captions. Fig. 3 shows the result of applying this model to HTP data. The result seems to neither generate meaningful captions nor recognize core objects.  Fig. 3 shows a sketch image of a house, a tree, a cat, and a bull in order. The captions generated automatically for these images are shown in Table I. By simply applying captioning model with attention [10], it should be noticed that the sketch object could not be recognized at all and the generated captions were difficult to understand. It implies that in order to generate proper caption of a sketch image, model must be able to learn some specific features of the sketch images. For this purpose, we adopted two sub-models and investigated the results of their application in the following two experiments.   First, we tested a model that classifies houses and trees to check whether traditional model is suitable to distinguish sketch image objects. So, we crawled the house and tree sketch data from Google images. Total 400 data were collected and deep learning based classification model was applied for learning. We created a simple CNN classification model. Fig. 4 shows the parameters of the model. We used the hyperparameters provided by Keras as default.
The result of training is shown in Fig. 5. But, the graph in Fig. 5 reveals that the learning model was underfitted due to lack of learning data. To solve this problem by increasing number of data, we utilized transfer learning and used ResNet50 model that was pre-trained on a large amount of image data.
After that, we fine-tuned the model of ResNet50. Fig. 6 shows the parameters of the entire ResNet Layer and the shallow layers stacked on it. The upper and omitted parts of Fig. 6 are the ResNet and the lower part is the newly stacked shallow layer for fine-tuning. (Since parameters of the ResNet are too long, some of them are omitted.) And we again used the hyperparameters provided by Keras as default. The result of this fine-tuned model is shown in Fig. 7. In epoch 4, accuracy of validation set reveals to be about 69.30%. Therefore, the accuracy of classifying sketch images into trees and houses is about 70%.

C. Psychological Feature Classification for Sketch Image (Number of Windows)
As a result of the above model, it appears that the simple sketch image objects can be classified properly. So we performed next experiments with the psychological feature classification model. It is a sub-model that classifies objects, then learns and identifies symbolic features, characteristics of objects. For example, in the art therapy, a window of the house symbolizes the passage of counselee to the outside world. In order to learn this symbol, we attempted to identify number of windows, which is a symbolic property of the house.
We crawled house sketch data from google image and collected about 500 sketch images of house. After that, we divided the collected data into houses with four or fewer widows and houses with five or more.
Similar to the former experiment, we expected that the result would be better in the learning through transfer learning than the simple CNN model. However, the model using transfer learning reveals lower performance and the model was not trained at all.
Instead of transfer learning, we performed end-to-end learning using CNN based classification model, and Fig. 8 shows the parameters of this model. Fig. 9 shows its training result. Although the number of data used in the learning was very small, it appears that accuracy of validation set gets to about 85%.  Pre-trained models such as ResNet50 are more appropriate for solving the problem of classifying different objects. In Art therapy, however, it is more important to learn symbolism in specific properties of objects than in simple classification problems. This implies that another careful treatment needs to be adopted for transfer learning for psychological feature classification.

V. CONCLUSION
In this paper, we proposed the concept of an AI Art therapy supporting system based on deep learning model. By applying deep learning model, we could provide efficiency and objectivity to the art therapy process. Our model identifies psychological features of objects that art therapists might have interest in. We expect that this model can support art therapists to make consistent and objective diagnosis.
In our first experiment, we confirmed that there were some limitations in applying general captioning models to art therapy. In the process of caption generation, there was a difficulty in recognizing psychological features directly from sketch image data. Therefore, prior to captioning, we introduced additional sub-models for firstly recognizing general objects from sketch image data and detecting psychological features after that.
In the above process, we presented two sub-models for the art therapy. First model utilized deep learning to recognize and classify drawing in simple sketches. We confirmed that deep learning model can recognize and classify even simple sketch image drawn by children. Second, beyond classifying simple objects, we build a model to learn and classify symbolic features of objects in sketch data. In art therapy, the model needs to be able to grasp symbolism, not just learning simple objects. We expect that the proposed model for learning psychological features of objects will contribute to learning and discovering psychologically meaningful factors.
However, this research is in the stage of proposing the overall framework for research goal, and concrete implementation and experiment on the whole process have not been accomplished yet. In future research, it is necessary to design detailed modules of the whole process, and to verify the performance of the proposed system through implementation and intensive experiments.