期刊全称 | Big Data | 期刊简称 | 7th CCF Conference, | 影响因子2023 | Hai Jin,Xuemin Lin,Yihua Huang | 视频video | | 学科分类 | Communications in Computer and Information Science | 图书封面 |  | 影响因子 | .This book constitutes the proceedings of the 7th CCF Conference on Big Data, BigData 2019, held in Wuhan, China, in October 2019..The 30 full papers presented in this volume were carefully reviewed and selected from 324 submissions. They were organized in topical sections as follows: big data modelling and methodology; big data support and architecture; big data processing; big data analysis; and big data application.. | Pindex | Conference proceedings 2019 |
1 |
Front Matter |
|
|
Abstract
|
2 |
|
|
|
Abstract
|
3 |
A Constrained Self-adaptive Sparse Combination Representation Method for Abnormal Event Detection |
Huiyu Mu,Ruizhi Sun,Li Li,Saihua Cai,Qianqian Zhang |
|
Abstract
Automated abnormal detection system meets the need of society for detecting and locating anomalies and alerting the operators. In this paper, we proposed a constrained self-adaptive sparse combination representation (CSCR). The spatio-temporal video volumes low-level features, which be stacked with multi-scale pyramid, can extract features effectively. The CSCR strategy is robust to learn dictionary and detect abnormal behaviors. Experiments on the published dataset and the comparison to other existing methods demonstrate the certain advantages of our method.
|
4 |
A Distributed Scheduling Framework of Service Based ETL Process |
DongJu Yang,ChenYang Xu |
|
Abstract
The use of service oriented computing paradigm and ETL (Extract-Transform-Load) technology has recently received significant attention to enable data warehouse construction and data integration. Aiming at improving scheduling and execution efficiency of service based ETL process, this paper proposes a distributed scheduling and execution framework for ETL process and a corresponding method. Firstly, add different weights to the ETL process to ensure the loading efficiency of core business data. Secondly, the scheduler selects the executors according to the performance and load, then allocates the ETL process execution request based on the greedy balance (GB) algorithm to make the load of the executor balancing. Thirdly, the executors parses ETL process to ETL services, then selects one or more executors to deploy and execute the ETL service according to the locality-aware strategy, that is, the amount of data involved and the distance of the node network which service involved, which can reduce the network overhead and improve execution efficiency. Finally, the effectiveness of the proposed method is verified by experimental comparison.
|
5 |
A Probabilistic Soft Logic Reasoning Model with Automatic Rule Learning |
Jia Zhang,Hui Zhang,Bo Li,Chunming Yang,Xujian Zhao |
|
Abstract
Probabilistic Soft Logic (PSL), as a declarative rule-based probability model, has strong extensibility and multi-domain adaptability and has been applied in many domains. In practice, a main difficult is that a lot of common sense and domain knowledge need to be set manually as preconditions for rule establishment, and the acquisition of these knowledge is often very expensive. To alleviate this dilemma, this paper has worked on two aspects: First, a rule automatic learning method was proposed, which combined AMIE+ algorithm and PSL to form a new reasoning model. Second, a multi-level method was used to improve the reasoning efficiency of the model. The experimental results showed that the proposed methods are feasible.
|
6 |
Inferring How Novice Students Learn to Code: Integrating Automated Program Repair with Cognitive Mod |
Yu Liang,Wenjun Wu,Lisha Wu,Meng Wang |
|
Abstract
Learning to code on Massive Open Online Courses (MOOCs) has become more and more popular among novice students while inferring how the students learn programming on MOOCs is a challenging task. To solve this challenge, we build a novel Intelligent Programming Tutor (IPT) which integrates the Automated Program Repair (APR) and student cognitive model. We improve an efficient APR engine, which can not only obtain repair results but also identify the types of programming errors. Based on APR, we extend the Conjunctive Factor Model (CFM) by using programming error classification as cognitive skill representation to support the student cognitive model on learning programming. We validate our IPT with the real dataset collected from a Python programming course. The results show that compared with the original CFM, our model can represent programming learning outcomes of students and predict their future performance more reliably. We also compare our student cognitive model with the state-of-the-art Deep Knowledge Tracing (DKT) model. Our model requires less training data and is higher interpretable than the DKT model.
|
7 |
Predicting Friendship Using a Unified Probability Model |
Zhijuan Kou,Hua Wang,Pingpeng Yuan,Hai Jin,Xia Xie |
|
Abstract
Now, it is popular for people to share their feelings, activities tagged with geography and temporal information in Online Social Networks (OSNs). The spatial and temporal interactions occurred in OSNs contain a wealth of information to indicate friendship between persons. Existing researches generally focused on single dimension: spatial or temporal dimension. The simplified model only works in limited scenarios. Here, we aim to understand the probability of friendship and the place and time of interactions. First, spatial similarity of interactions is defined as a vector of places where persons checked in. Second, we employ exponential functions to characterize the change of strength of interactions as time goes on. Finally, a unified probability model to predict friendship between two persons is given. The model contains two sub-models based on spatial similarity and temporal similarity respectively. The experimental results on four data sets including spatial data sets (Gowalla and Weeplaces) and temporal data sets (Higgs Twitter Data set, High school Call Data set) show that our model works as expected.
|
8 |
Product Feature Extraction via Topic Model and Synonym Recognition Approach |
Jun Feng,Wen Yang,Cheng Gong,Xiaodong Li,Rongrong Bo |
|
Abstract
As e-commerce is becoming more and more popular, sentiment analysis of online reviews has become one of the most important studies in text mining. The main task of sentiment analysis is to analyze the user’s attitude towards different product features. Product feature extraction refers to extracting the product features of user evaluation from reviews, which is the first step to achieve further sentiment analysis tasks. The existing product feature extraction methods do not address flexibility and randomness of online reviews. Moreover, these methods have defects such as low accuracy and recall rate. In this study, we propose a product feature extraction method based on topic model and synonym recognition. Firstly, we set a threshold that TF-IDF value of a product feature noun must reach to filter meaningless words in reviews, and select the threshold by grid search. Secondly, considering the occurrence rule of different product features in reviews, we propose a novel product similarity calculation, which also performs weighted fusion based on information entropy with a variety of general similarity calculation methods. Finally, compared with traditional methods, the experimental r
|
9 |
Vietnamese Noun Phrase Chunking Based on BiLSTM-CRF Model and Constraint Rules |
Hua Lai,Chen Zhao,Zhengtao Yu,Shengxiang Gao,Yu Xu |
|
Abstract
In natural language processing, the use of chunk analysis instead of parsing can greatly reduce the complexity of parsing. Noun phrase chunks, as one of the chunks, exist in a large number of sentences and play important syntactic roles such as subject and object. Therefore, it is very important to achieve high-performance recognition of noun phrase chunks for syntactic analysis. This paper presents a Vietnamese noun phrase block recognition method based on BiLSTM-CRF model and constraint rules. This method first carries out part-of-speech tagging, and integrates the marked part-of-speech features into the input vector of the model in the form of splicing. Secondly, the constraints rules are obtained by analyzing the Vietnamese noun phrase blocks. Finally, the constraints rules are integrated into the output layer of the model to realize the further optimization of the model. The experimental results show that the accuracy, recall and F-value of the method are 88.08%, 88.73% and 88.40% respectively.
|
10 |
|
|
|
Abstract
|
11 |
A Distributed Big Data Discretization Algorithm Under Spark |
Yeung Chan,Xia Jie Zhang,Jing Hua Zhu |
|
Abstract
Data discretization is one of the important steps of data preprocessing in data mining, which can improve the data quality and thus improve the accuracy and time performance of the subsequent learning process. In the era of big data, the traditional discretization method is no longer applicable and distributed discretization algorithms need to be designed. Hellinger-entropy as an important distance measurement method in information theory is context-sensitive and feature-sensitive and thus are abundant of useful information. Therefore, in this paper we implement a Hellinger-entropy based distributed discretization algorithm under Apache Spark. We first measure the divergence of discrete intervals using Hellinger-entropy. Then we select top-k boundary points according to the information provided by the divergence value of discrete intervals. Finally, we divide the continuous variable range into k discrete intervals. We verficate the distributed discretization performance in the preprocessing of random forest, Bayes and multilayer perceptron classification on real sensor big data sets. Experimental results show that the time performance and classification accuracy of the distributed
|
12 |
A Novel Distributed Duration-Aware LSTM for Large Scale Sequential Data Analysis |
Dejiao Niu,Yawen Liu,Tao Cai,Xia Zheng,Tianquan Liu,Shijie Zhou |
|
Abstract
Long short-term memory (LSTM) is an important model for sequential data processing. However, large amounts of matrix computations in LSTM unit seriously aggravate the training when model grows larger and deeper as well as more data become available. In this work, we propose an efficient distributed duration-aware LSTM(D-LSTM) for large scale sequential data analysis. We improve LSTM’s training performance from two aspects. First, the duration of sequence item is explored in order to design a computationally efficient cell, called duration-aware LSTM(D-LSTM) unit. With an additional mask gate, the D-LSTM cell is able to perceive the duration of sequence item and adopt an adaptive memory update accordingly. Secondly, on the basis of D-LSTM unit, a novel distributed training algorithm is proposed, where D-LSTM network is divided logically and multiple distributed neurons are introduced to perform the easier and concurrent linear calculations in parallel. Different from the physical division in model parallelism, the logical split based on hidden neurons can greatly reduce the communication overhead which is a major bottleneck in distributed training. We evaluate the effectiveness of t
|
13 |
Considering User Distribution and Cost Awareness to Optimize Server Deployment |
Yanling Shao,Wenyong Dong |
|
Abstract
In edge computing systems, it is crucial issue to select suitable placement sites and quantity of servers so as to realize the low latency of Internet of Things (IoT) applications and balance the sever utilization. Hence, this paper proposes a cost-aware edge server optimization deployment method. Firstly, we model the edge server placement problem as a Mixed Integer Nonlinear Programming problem (MNIP), which comprehensively considers the resource allocation ratio, regional average load, and access delay. And then, the Benders decomposition algorithm is employed to solve it. The simulation results show that the proposed method can find better solution to place the edge micro datacenter (MDC) compared with the state-of-art server deployment strategies in terms of latency for applications and utilization of resources.
|
14 |
Convolutional Neural Networks on EEG-Based Emotion Recognition |
Chunbin Li,Xiao Sun,Yindong Dong,Fuji Ren |
|
Abstract
Human Computer Interaction (HCI) enables people to transfer and exchange information with computers. For the purpose of friendliness, integrating HCI with emotional factors has been intensively investigated. In this paper, an effective method is proposed to recognize human emotions by electroencephalogram (EEG) signals, which record electrical activities of the brain. First of all, the EEG signals are converted to the multispectral image that preserves the local distance between any two nearby electrodes. Notably, our method preserves the features of EEG signals in frequency and spatial dimensions, unlike standard EEG analysis techniques inaccurately interpreting the location of electrodes. And then a Convolutional Neural Network (CNN) model is performed to identify human emotions by virtue of the image containing EEG feature, for the reason of CNN’s significant effect in image recognition. A publicly available dataset, DEAP dataset, is used to validate our algorithm. The results show that the mean classification accuracy is 81.64% for valence (low and high) and 80.25% for arousal (low and high) across 32 subjects.
|
15 |
Distributed Subgraph Matching Privacy Preserving Method for Dynamic Social Network |
Xiao-Lin Zhang,Hao-chen Yuan,Zhuo-lin Li,Huan-xiang Zhang,Jian Li |
|
Abstract
The growing popularity of cloud platforms store and process large-scale social network data, if we do not pay attention to the method of using a cloud platform, privacy leakage will become a serious problem. In this paper, we propose a distributed k-automorphism algorithm and a distributed subgraph matching method, the distributed k-automorphism algorithm can efficiently protect the privacy of the social networks in the cloud platform by adding noise edges to ensure k-automorphism and the distributed subgraph matching method can quickly obtain temp subgraph matching results. After temp results are joined, we can obtain correct results by recovering and filtering temp results according to the symmetry of the k-automorphism graph and k-automorphism functions in the client. We also propose a modified method that utilizing incremental thought to solve the problem of dynamic subgraph matching. The experiments show that the above methods are effective in dealing with large scale social network graph problem and these methods can effectively solve the problem of privacy leakage of subgraph matching.
|
16 |
|
|
|
Abstract
|
17 |
Clustering-Anonymization-Based Differential Location Privacy Preserving Protocol in WSN |
Ren-ji Huang,Qing Ye,Mo-Ci Li |
|
Abstract
Playing a vital role in the period of big data and intelligent life, wireless sensor networks (WSN) transmits a bulk of data. Location information as the vital data in transmission is widely used in detecting and routing for the network. With the big data mining and analysis, the security of location and data privacy in WSN faces great challenges. To the problem of active attacking like node capture in wireless sensor network node location privacy, existing location privacy preserving protocols are analyzed and Differential Location Privacy protocol based on Clustering Anonymization is proposed. By sensor nodes clustering using genetic clustering algorithm, the individual location is hidden in the statistical location information of the group. The Laplace Mechanism is also added to the protocol to realize differential location privacy. Node location privacy in WSN is preserved as well as privacy preserving budget is saved. The result of theoretical analysis and contrastive simulation experience shows that the protocol can be useful.
|
18 |
Distributed Graph Perturbation Algorithm on Social Networks with Reachability Preservation |
Xiaolin Zhang,Jian Li,Xiaoyu He,Jiao Liu |
|
Abstract
With the rapid development of social networks, the current scale of graph data continues to increase, and the performance of anonymous social network methods is limited. Node reachability query is essential in directed graphs, which can reflect the relationship between nodes and the direction of information dissemination. Aiming at the problem of the reachability of nodes between directed social network privacy technologies, this paper proposes a reachability preserving distribution perturbation (RPDP) algorithm, which is based on the distributed graph processing system GraphX. This algorithm first generates a Random Neighborhood Table (RNT) composed of four tuples for the nodes and then uses the message transmission of GraphX and “probe” mechanism. The proposed algorithm improves the disposal efficiency of the large-scale social network while maintaining the reachability of the nodes. Experiments based on the real social network data show that the proposed algorithm can keep the node reachability and deal with large-scale social network efficiently while protecting the character of the graph structure.
|
19 |
Minimum Spanning Tree Clustering Based on Density Filtering |
Ke Wang,Xia Xie,Jiayu Sun,Wenzhi Cao |
|
Abstract
Clustering analysis is an important method in data mining. In order to recognize clusters with arbitrary shapes as well as clusters with different density, we propose a new clustering approach: minimum spanning tree clustering based on density filtering. It masks the low-density points in the density filtering step, which reduces the interference of noise and makes the gap between clusters clearer. It uses relative values of adjacent distances to find mutations of density and changes between clusters to divide data sets. It is tested on multiple synthetic data sets and real-world data sets, the results of which show that the algorithm is able to detect clusters with arbitrary shape and it is insensitive to the imbalance of density between clusters. It has achieved great results on multiple data sets.
|
20 |
Research of CouchDB Storage Plugin for Big Data Query Engine Apache Drill |
Yulei Liao,Liang Tan |
|
Abstract
Currently, the document-oriented database supported by Apache Drill is only MongoDB. However, due to the lack of data model, application interface, security and usability of MongoDB, Apache Drill is limited in querying and processing document data. CouchDB is an emerging document-oriented database. Compared to MongoDB, CouchDB has the advantage of supporting triggers, running in Android and BSD environments, rendering in JSON format, and supporting any language that supports HTTP requests, but CouchDB has low query performance and does not support standard SQL queries. Therefore, the research on the CouchDB storage plugin for Apache Drill makes sense. This paper first researches the basic architecture of CouchDB and Apache Drill and the query flow of Apache Drill, and the ValueVector data structure, then designs and implements CouchDB storage plugin based on Apache Drill’s storage plugin specification and CouchDB’s application programming interface. With a simple configuration, users can use CouchDB as a data source for the Apache Drill query engine. Experiments show that the CouchDB Storage Plugin not only further enhances Apache Drill’s query and management capabilities for docum
|
|
|