With the rise of various social platforms, a growing number of content is generated by users on the internet, which produces a large amount of text information, such as news, Weibo, blogs, etc. Faced with such huge and emotional text information, it is entirely possible to consider exploring their potential value to serve people. Therefore, in recent years, emotional analysis has been paid close attention by researchers in the field of computer linguistics and has become a hot research task.
The goal of this question is to accurately distinguish the emotional polarity of text in a big data set. Emotions can be divided into three types: positive, negative and neutral. Faced with the vast amount of news information, it is of great significance for the effective monitoring, warning and guiding of public opinion and healthy development of public opinion ecosystem to accurately identify the emotional tendencies hidden in it.
Task
Participants need to categorize the emotional polarity of the news data provided by us. Positive emotions correspond to 0, neutral emotions correspond to 1 and negative emotions correspond to 2. According to the training data provided by us, the emotional polarity of the news in the test set should be judged by your algorithm or model
This competition provides three data sets, training set-train. txt, evaluation set-evaluate. txt and test set-test. txt. The data format are as follows:
Field
Type
Description
Note
news_id
String
新闻ID News ID
title
String
标题内容 Title content
content
String
新闻正文内容 Content of news text
label
String
新闻情感标签 Emotional label in news
BaseLine
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#! -*- coding:utf-8 -*- import codecs
import pandas as pd import codecs, gc import numpy as np from sklearn.model_selection import StratifiedKFold from keras_bert import load_trained_model_from_checkpoint, Tokenizer from keras.layers import * from keras.callbacks import * from keras.models import Model import keras.backend as K from keras.optimizers import Adam from keras.utils import to_categorical from sklearn.metrics import f1_score, recall_score, precision_score
1 2 3 4 5 6 7 8 9 10 11 12 13
train_lines = codecs.open('Train_DataSet.csv').readlines()[1:] train_df = pd.DataFrame({ 'id': [x[:32] for x in train_lines], 'ocr': [x[33:].strip() for x in train_lines] }) train_label = pd.read_csv('Train_DataSet_Label.csv') train_df = pd.merge(train_df, train_label, on='id')
test_lines = codecs.open('Test_DataSet.csv').readlines()[1:] test_df = pd.DataFrame({ 'id': [x[:32] for x in test_lines], 'ocr': [x[33:].strip() for x in test_lines] })
Here is a description. The maximum length of BERT is 512, and the maximum length is set to 500. By default, BERT intercepts the first N characters, but the feature is not the best first N. After verification, the effect of head and tail is significantly higher than the head or tail
with codecs.open(dict_path, 'r', 'utf-8') as reader: for line in reader: token = line.strip() token_dict[token] = len(token_dict)
1 2 3 4 5 6 7 8 9 10 11
classOurTokenizer(Tokenizer): def_tokenize(self, text): R = [] for c in text: if c in self._token_dict: R.append(c) elif self._is_space(c): R.append('[unused1]') else: R.append('[UNK]') return R
1
tokenizer = OurTokenizer(token_dict)
1 2 3 4 5 6
defseq_padding(X, padding=0): L = [len(x) for x in X] ML = max(L) return np.array([ np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X ])
x = bert_model([x1_in, x2_in]) x = Lambda(lambda x: x[:, 0])(x) p = Dense(nclass, activation='softmax')(x)
model = Model([x1_in, x2_in], p) model.compile(loss='categorical_crossentropy', optimizer=Adam(1e-5), metrics=['accuracy', acc_top2]) print(model.summary()) return model
Among them, TP is the true example, FP is the false positive case, FN is the false negative case. The value of f1 is obtained by the above formula. Macro-F1 value can be calculated out by averaging f1 value.
train_pred = [np.argmax(x) for x in train_model_pred] train_df['label'] = train_pred train_df[['id', 'label']].to_csv('Train_pred_baseline.csv', index=None)
test_model_pred = test_model_pred / nfold test_prob = test_model_pred.tolist() test_pred = [np.argmax(x) for x in test_model_pred]