All Issues

Volume 2, 2017

Volume 1, 2016

Big Data & Information Analytics

July 2016 , Volume 1 , Issue 2&3

Select all articles


A review on low-rank models in data analysis
Zhouchen Lin
2016, 1(2&3): 139-161 doi: 10.3934/bdia.2016001 +[Abstract](6271) +[PDF](946.5KB)
Nowadays we are in the big data era. The high-dimensionality ofdata imposes big challenge on how to process them effectively andefficiently. Fortunately, in practice data are not unstructured.Their samples usually lie around low-dimensional manifolds andhave high correlation among them. Such characteristics can beeffectively depicted by low rankness. As an extension to thesparsity of first order data, such as voices, low rankness is alsoan effective measure for the sparsity of second order data, suchas images. In this paper, I review the representative theories,algorithms and applications of the low rank subspace recoverymodels in data processing.
Born to be big: Data, graphs, and their entangled complexity
Enrico Capobianco
2016, 1(2&3): 163-169 doi: 10.3934/bdia.2016002 +[Abstract](3338) +[PDF](650.0KB)
Big Data and Big Graphs have become landmarks of current cross-border research, destined to remain so for long time. While we try to optimize the ability of assimilating both, novel methods continue to inspire new applications, and vice versa.Clearly these two big things, data and graphs, are connected, but can we ensure management of their complexities, computational efficiency, robust inference? Critical bridging features are addressed here to identify grand challenges and bottlenecks.
Time series based urban air quality predication
Ruiqi Li, Yifan Chen, Xiang Zhao, Yanli Hu and Weidong Xiao
2016, 1(2&3): 171-183 doi: 10.3934/bdia.2016003 +[Abstract](3569) +[PDF](1882.6KB)
Urban air pollution post a great threat to human health, and has been a major concern of many metropolises in developing countries. Lately, a few air quality monitoring stations have been established to inform public the real-time air quality indices based on fine particle matters, e.g. $PM_{2.5}$, in countries suffering from air pollutions. Air quality, unfortunately, is fairly difficult to manage due to multiple complex human activities from driving to smelting. We observe that human activities' hidden regular pattern offers possibility in predication, and this motivates us to infer urban air condition from the perspective of time series. In this paper, we focus on $PM_{2.5}$ based urban air quality, and introduce two kinds of time-series methods for real-time and fine-grained air quality prediction, harnessing historical air quality data reported by existing monitoring stations. The methods are evaluated based in the real-life $PM_{2.5}$ concentration data in the year of 2013 (January - December) in Wuhan, China.
How do I choose the right NoSQL solution? A comprehensive theoretical and experimental survey
Hamzeh Khazaei, Marios Fokaefs, Saeed Zareian, Nasim Beigi-Mohammadi, Brian Ramprasad, Mark Shtern, Purwa Gaikwad and Marin Litoiu
2016, 1(2&3): 185-216 doi: 10.3934/bdia.2016004 +[Abstract](4337) +[PDF](1687.0KB)
With the advent of the Internet of Things (IoT) and cloud computing,the need for data stores that would be able to store and process big data inan ecient and cost-e ective manner has increased dramatically. Traditionaldata stores seem to have numerous limitations in addressing such requirements.NoSQL data stores have been designed and implemented to address the shortcomingsof relational databases by compromising on ACID and transactionalproperties to achieve high scalability and availability. These systems are designedto scale to thousands or millions of users performing updates, as wellas reads, in contrast to traditional RDBMSs and data warehouses. Althoughthere is a plethora of potential NoSQL implementations, there is no one-size- t-all solution to satisfy even main requirements. In this paper, we explorepopular and commonly used NoSQL technologies and elaborate on their documentation,existing literature and performance evaluation. More speci cally,we will describe the background, characteristics, classi cation, data model andevaluation of NoSQL solutions that aim to provide the capabilities for big dataanalytics. This work is intended to help users, individuals or organizations,to obtain a clear view of the strengths and weaknesses of well-known NoSQLdata stores and select the right technology for their applications and use cases.To do so, we rst present a systematic approach to narrow down the properNoSQL candidates and then adopt an experimental methodology that can berepeated by anyone to nd the best among short listed candidates consideringtheir speci c requirements.
Forward supervised discretization for multivariate with categorical responses
Wenxue Huang and Qitian Qiu
2016, 1(2&3): 217-225 doi: 10.3934/bdia.2016005 +[Abstract](2573) +[PDF](345.0KB)
Given a data set with one categorical response variable and multiple categorical or continuous explanatory variables, it is required in some applications to discretize the continuous explanatory ones. A proper supervised discretization usually achieves a better result than the unsupervised ones. Rather than individually doing so as recently proposed by Huang, Pan and Wu in [12,13], we suggest a forward supervised discretization algorithm to capture a higher association from the multiple explanatory variables to the response variable. Experiments with the GK-tau and the GK-lambda are presented to support the statement.
Detecting coalition attacks in online advertising: A hybrid data mining approach
Qinglei Zhang and Wenying Feng
2016, 1(2&3): 227-245 doi: 10.3934/bdia.2016006 +[Abstract](2876) +[PDF](1029.1KB)
Coalition attack is nowadays one of the most common type of attacks in the industry of online advertising. In this paper, we attempt to mitigate the problem of frauds by proposing a hybrid framework that detects the coalition attacks based on multiple metrics. We also articulate the theoretical basis for these metrics to be integrated into the hybrid framework. Furthermore, we instance the framework with two metrics and develop a detection system that identifies the coalition attacks from two distinguish perspectives.
Scalable clustering by truncated fuzzy $c$-means
Guojun Gan, Qiujun Lan and Shiyang Sima
2016, 1(2&3): 247-259 doi: 10.3934/bdia.2016007 +[Abstract](3403) +[PDF](964.6KB)
Most existing clustering algorithms are slow for dividing a large dataset into a large number of clusters. In this paper, we propose a truncated FCM algorithm to address this problem. The main idea behind our proposed algorithm is to keep only a small number of cluster centers during the iterative process of the FCM algorithm. Our numerical experiments on both synthetic and real datasets show that the proposed algorithm is much faster than the original FCM algorithm and the accuracy is comparable to that of the original FCM algorithm.
Time aware topic based recommender system
Elnaz Delpisheh, Aijun An, Heidar Davoudi and Emad Gohari Boroujerdi
2016, 1(2&3): 261-274 doi: 10.3934/bdia.2016008 +[Abstract](3736) +[PDF](403.1KB)
News recommender systems efficiently handle the overwhelming number of news articles, simplify navigations, and retrieve relevant information. Many conventional news recommender systems use collaborative filtering to make recommendations based on the behavior of users in the system. In this approach, the introduction of new users or new items can cause the cold start problem, as there will be insufficient data on these new entries for the collaborative filtering to draw any inferences for new users or items. Content-based news recommender systems emerged to address the cold start problem. However, many content-based news recommender systems consider documents as a bag-of-words neglecting the hidden themes of the news articles. In this paper, we propose a news recommender system leveraging topic models and time spent on each article. We build an automated recommender system that is able to filter news articles and make recommendations based on users' preferences. We use topic models to identify the thematic structure of the corpus. These themes are incorporated into a content-based recommender system to filter news articles that contain themes that are of less interest to users and to recommend articles that are thematically similar to users' preferences. Our experimental studies show that utilizing topic modeling and spent time on a single article can outperform the state of the arts recommendation techniques. The resulting recommender system based on the proposed method is currently operational at The Globe and Mail (
Manifold data mining helps businesses grow more effectively
Zhen Mei
2016, 1(2&3): 275-276 doi: 10.3934/bdia.2016009 +[Abstract](2647) +[PDF](156.5KB)
This note introduces the research and development capacity of a data mining leader in Canada--Manifold Data Mining Inc. (Manifold)--and its collaboration with academic community.




Email Alert

[Back to Top]