# Alumni Dissertations and Theses

#### Filter Dissertations and Theses By:

### Clustering Categorical Data Using Data Summaries and Spectral Techniques

Author:Eman AbduYear of Dissertation:2009Program:Computer ScienceAdvisor:Bilal KhanAbstract:Cluster analysis is an active area of research with applications in various fields including information retrieval, social sciences, bioinformatics, object recognition, and image segmentation (Jain et al., 1999). However, most algorithms are intended for numerical (continuous) data where proximity among data objects is naturally defined by virtue of their numerical properties. Although these algorithms can be used on categorical data, they are not designed to handle data properties typically found in this data type such as high dimensionality and lack of inherent relationships among attribute values. During the past decade, several algorithms have been designed for categorical data such as K-modes (Huang, 1998), STIRR (Gibson et al., 1998), CACTUS (Ganti et al., 1999), ROCK (Guha et al., 1999), COOLCAT (Barbara et al., 2002), LIMBO (Andritsos et al., 2004), CLICKS (Zaki et al., 2007), and others. Some of these algorithms exploit attribute relationships through data summaries such as attributes occurrence and co-occurrence frequencies while others use information entropy and links among data objects. In this thesis, we focus on using data summaries and spectral analysis to detect clustering structure in categorical data. Spectral techniques provide a relaxed solution to the discrete clustering problem which has been shown to be NP-hard (Drineas et al., 2004). Formulating the clustering problem as a graph partitioning problem and then finding the minimum normalized cut leads to a solution based on eigenvectors of the similarity matrix (i.e. Laplacian matrix). Spectral methods have been used in various algorithms and have been shown to find non-linearly separable clusters. Equally important, spectral analysis encompasses techniques for handling high-dimensional data since input data is projected into a lower-dimensional space where all computation/comparisons can be performed. Our approach is to extend spectral techniques to data summaries which are relatively less expensive to compute than data object similarity matrix for very large data sets. Our goal is to combine the benefits of spectral analysis with the relative low cost of computing data summaries. We have developed three algorithms for clustering categorical data using data summaries. Two of them use spectral techniques. Our test results on standard data sets and synthetic data sets show that our algorithms are competitive with current spectral and non-spectral algorithms for categorical data. Our algorithms provide a solution to the categorical data clustering problem that produces quality clustering and is scalable to large data sets.

### Performance Effects of Computer-Based Multitasking Behavior

Author:Rachel AdlerYear of Dissertation:2012Program:Computer ScienceAdvisor:Raquel Benbunan-FichAbstract:This research examines multitasking from the perspective of human-computer interaction (HCI). Multitasking is defined as the performance of multiple tasks concurrently. In a computer-based environment, users generally switch between multiple computer-based tasks either due to a personal decision to break from the current task or due to an external interruption, such as an electronic notification. This dissertation describes an in-depth empirical study, using a laboratory setting with different numeric, verbal, and visual computer-based tasks. Six hundred and thirty six subjects were randomly assigned into three conditions: discretionary multitasking, where participants were allowed to decide when and how often to switch tasks, forced multitasking, where participants were forced to switch tasks at certain allotted times, and non-multitasking, where participants performed the tasks sequentially and were not allowed to multitask. In order to investigate performance effectiveness (accuracy) and performance efficiency (productivity), participants' overall accuracy and productivity scores were compared across conditions. The results suggest that during difficult tasks, subjects who were forced to multitask had the lowest accuracy. In addition, those subjects in the forced multitasking condition who felt the primary task was difficult had lower accuracy than those who felt the task was easy. This was not true in the other two conditions. Receiving interruptions during a difficult task impacted not only their primary task, but their secondary tasks as well. In the discretionary multitasking condition, the more subjects decided to multitask, the lower their accuracy scores. In fact, an additional analysis revealed that high multitaskers not only performed worse than low and medium multitaskers in the discretionary condition, but actually had the worst performance than subjects in any other condition. Medium multitaskers, however, had the highest productivity scores. While multitasking in that case was considered the best in terms of efficiency, it was not true in terms of effectiveness. Therefore, discretionary multitasking gives the illusion of high performance. Furthermore, this study also explored why people chose to multitask and the impact that had on performance. The results of this study can assist HCI researchers in developing a more comprehensive understanding of user multitasking which can lead to better interface designs.

### SCHEDULING AND RESOURCE ALLOCATION IN WIRELESS SENSOR NETWORKS

Author:Yosef AlayevYear of Dissertation:2014Program:Computer ScienceAdvisor:Amotz Bar-NoyAbstract:In computer science and telecommunications, wireless sensor networks are an active research area. Each sensor in a wireless sensor network has some pre-defined or on demand tasks such as collecting or disseminating data. Network resources, such as broadcast channels, number of sensors, power, battery life, etc., are limited. Hence, a schedule is required to optimally allocate network resources so as to maximize some profit or minimize some cost. This thesis focuses on scheduling problems in the wireless sensor networks environment. In particular, we study three scheduling problems in the wireless sensor networks: broadcast scheduling, sensor scheduling for area monitoring, and content distribution scheduling. For each problem the goal is to find efficient scheduling algorithms that have good approximation guarantees and perform well in practice.

### Some Non-Classical Methods in Epistemic Logic and Games

Author:Can BaskentYear of Dissertation:2012Program:Computer ScienceAdvisor:Rohit ParikhAbstract:In this dissertation, we consider some non-classical methods in epistemic logic and games. We first consider, dynamic epistemic logics in topological and geometric semantics, and then extend such ideas to the cases where inconsistencies are allowed. Then, as a case example, we discuss a well known paradox in game theory which is essentially a two-person Russell's paradox. Finally, we conclude with considering an alternative approach to games where strategies are considered as the primitives of the theory, and advancing some results.

### ENHANCING THE PERFORMANCE OF ACTIVE CONNECTIONS IN MANETS THROUGH DYNAMIC ROUTE AND POWER OPTIMIZATION

Author:Zeki BilginYear of Dissertation:2010Program:Computer ScienceAdvisor:Bilal KhanAbstract:In this thesis, we consider two significant problems that occur within active connections in mobile ad hoc networks (MANETs). These are: (A) degradation of path optimality in terms of hop count, and (B) failures on the constituents links of a path. Both phenomena occur over time because of node movement. Our investigation considers what can be done to minimize their occurrence of both, after the problem of initial route selection has been resolved by standard MANET routing protocols. In developing solutions to the aforementioned problems, we identified two broad and complementary approaches: (i) Variable topology, fixed power: These approaches assume that the transmission power of the nodes is kept fixed, but the topology of the connections is modifiable during their lifetimes. (ii) Variable power, fixed topology: These approaches assume that the topological structure of the connection must be kept fixed, but the transmission power levels used by constituent nodes is adjustable. Within approach (i), we developed (A) two new route optimization schemes that seek to shorten path lengths by eliminating inessential hops "on-the-fly", without relying on promiscuous mode of wireless cards, and (B) two new route maintenance schemes that circumvent impending link failures and heal broken links in an efficient way. We implemented our schemes in the ns2 packet level network simulator, as extension to the Ad hoc On Demand Distance Vector (AODV) routing protocol. Through extensive simulations, we show that our schemes are able to optimize path lengths, increase connection lifetime, reduce overall control traffic overhead, decrease end-to-end delay, and provide energy savings in packet transmissions. Within approach (ii), we developed (B) several new dynamic power budget distribution schemes. These were evaluated using a new model in which each connection is assigned a fixed power budget, and seeks to distribute this budget among its constituent nodes so as to increase the connection's lifetime. We implemented our schemes as a discrete event simulation. Through extensive simulation experiments, we showed that our schemes are able to consistently improve connection lifetimes without excessive additional control traffic overhead. The conclusions of both studies are seen to hold scalably as one varies situational parameters such as network size, number of connections, and node mobility levels.

### PRIVACY-PRESERVING QUERY PROCESSING ON TEXT DOCUMENTS

Author:Sahin BuyrukbilenYear of Dissertation:2013Program:Computer ScienceAdvisor:Spiridon BakirasAbstract:Privacy-preserving query processing is an essential component for data processing, especially in outsourced databases, or in data operations which have special security and privacy requirements such as sharing of sensitive data. While cloud computing and data outsourcing attract an increasing number of customers, the security and privacy of sensitive data still remains an open problem. Encryption secures the data against unauthorized access, but it does not provide the ability to query the data unless the encryption scheme is searchable. Searchable encryption can be either private or public key depending on the needs of the user. In general, private-key solutions are faster but suffer from a key management problem. On the other hand, public-key solutions provide more flexibility but their running times are much higher than private-key protocols. Furthermore, parties may sometimes be forced to share data in order to comply with regulations or agreements. For example, different health care companies or intelligence agencies may need to find whether they have similar records in their databases without compromising privacy. Consequently, privacy-preserving similarity search between text documents is an emerging field as sensitive data sharing becomes inevitable. In this dissertation we present two privacy-preserving text processing protocols: (i) a ranked keyword search mechanism over outsourced public-key encrypted data and (ii) a similar document detection system. We introduce efficient algorithms for answering these query types and illustrate their feasibility in real-life applications.

### PASSIVE INDOOR LEVELED RFID LOCALIZATION ALGORITHMS

Author:Matthew ChanYear of Dissertation:2013Program:Computer ScienceAdvisor:Xiaowen ZhangAbstract:One of the most sought-after innovations in RFID technology is the ability to accurately locate stationary objects and track moving entities in real time. The author proposes three multi-leveled detectable count RFID localization algorithms (nearest-neighbor, multilateration, Bayesian inference) to accomplish these tasks using UHF passive RFID tags--chosen due to low cost and efficient implementation--by affixing them onto the floor as known reference nodes. Simulations are conducted to examine the accuracy and performance of the algorithms to locate stationary and mobile objects. Furthermore, experiments are carried out to test the localization of stationary objects in a real world setting such as a laboratory environment. The outcomes from the simulations and experiments are analyzed. The results are remarkable and most importantly, when the proper parametric values are considered, such as reference tag density and detection range, the accuracy performance of the algorithms achieved are impressive which confirms that the proposed methods are highly preferable when accurate, efficient and cost-effective passive RFID localization systems are to be implemented. Future directions of the study include exploration of different ratios for the three power levels of the RFID reader, use of other reference tag spacing pattern besides square such as hexagon, examination of other multi-level approach beside tri-level such as quad-, penta- or dual-level, experimentation with different kinds of RFID reference tags besides the passive Alien type G, as well as field tests of methods for mobile entities in a realistic real-world settings such as a laboratory.

### Conflict-free coloring

Author:Panagiotis CheilarisYear of Dissertation:2009Program:Computer ScienceAdvisor:Stathis ZachosAbstract:Graph and hypergraph colorings constitute an important subject in combinatorics and algorithm theory. In this work, we study conflict-free coloring for hypergraphs. Conflict-free coloring is one possible generalization of traditional graph coloring. Conflict-free coloring hypergraphs induced by geometric shapes, like intervals on the line, or disks on the plane, has applications in frequency assignment in cellular networks. Colors model frequencies and since the frequency spectrum is limited and expensive, the goal of an algorithm is to minimize the number of assigned frequencies, that is, reuse frequencies as much as possible. We concentrate on an online variation of the problem, especially in the case where the hypergraph is induced by intervals. For deterministic algorithms, we introduce a hierarchy of models ranging from static to online and we compute lower and upper bounds on the numbers of colors used. In the randomized oblivious adversary model, we introduce a framework for conflict-free coloring a specific class of hypergraphs with a logarithmic number of colors. This specific class includes many hypergraphs arising in geometry and gives online randomized algorithm that use fewer colors and fewer random bits than other algorithms in the literature. Based on the same framework, we initiate the study of online deterministic algorithms that recolor few points. For the problem of conflict-free coloring points with respect to a given set of intervals, we describe an efficient algorithm that computes a coloring with at most twice the number of colors of an optimal coloring. We also show that there is a family of inputs that force our algorithm to use two times the number of colors of an optimal solution. Then, we study conflict-free coloring problems in graphs. We compare conflict-free coloring with respect to paths of graphs to a closely related problem, called vertex ranking, or ordered coloring. For conflict-free coloring with respect to neighborhoods of vertices of graphs, we prove that number of colors in the order of the square root of the number of vertices is sufficient and sometimes necessary. Finally, we initiate the study of Ramsey-type problems for conflict-free colorings and compute a van der Waerden-like number.

### Efficient controls for finitely convergent sequential algorithms and their applications

Author:Wei ChenYear of Dissertation:2010Program:Computer ScienceAdvisor:Gabor HermanAbstract:Finding a feasible point that satisfies a set of constraints or a point that optimizes a function subject to a set of constraints is a common task in scientific computing: examples are the linear/convex feasibility/optimization problems. Projection methods have been used widely in solving many such problems of large size much more efficiently than other alternatives. Finitely convergent sequential algorithms are projection methods that sequentially iterate on individual constraints, one at a time, but overall find a feasible point in a finite number of iterations. ART3 is an example using cyclic control in the sense that it repeatedly cycles through the given constraints. The skipping unnecessary checks on constraints that are likely to be satisfied, lead to the new algorithm ART3+, a variant of ART3 whose control is no longer cyclic, but which is still finitely convergent. Experiments in fitting pixel images by blob images show that ART3+ is statistically significantly faster than ART3. Furthermore, a general methodology is proposed for automatic transformation of any finitely convergent sequential algorithm in such a way that (1) finite convergence is retained and (2) the speed of finite convergence is improved. The first property is proved by mathematical theorems, the second is illustrated by applying the algorithms to practical problems. This transformation is applicable, for example, to the finitely convergent modified cyclic subgradient projection algorithm for solving convex feasibility problems. One application is to intensity modulated radiation therapy (IMRT), whose goal is to deliver sufficient doses to tumors to kill them, but without causing irreparable damage to critical organs. The superior performance of the ART3+ for IMRT is demonstrated. An optimization algorithm based on ART3+ is proposed to handle linear constraints with either linear objectives or certain convex objectives. It is on average more than two orders of magnitude faster than the state of art industry-standard commercial algorithms (MOSEK's interior point optimizer and primal/dual simplex optimizer) and has almost no memory overhead. This allows for fast creation of multi-criteria treatment plan databases that span the global planning options available to the treatment planner.

### COLLABORATIVE RANKING AND COLLABORATIVE CLUSTERING

Author:Zheng ChenYear of Dissertation:2013Program:Computer ScienceAdvisor:Heng JiAbstract:Ranking and clustering are two important problems in machine learning and have wide applications in Natural Language Processing (NLP). A ranking problem is typically formulated as ranking a collection of candidate "objects" with respect to a "query" while a clustering problem is formulated as organizing a set of instances into groups such that members in each group share some similarity while members across groups are dissimilar. In this thesis, we introduce collaborative schemes into ranking and clustering problems, and name them as "collaborative ranking" and "collaborative clustering" respectively. Contrast to the tradition non-collaborative schemes, collaborative ranking leverages strengths from multiple query collaborators and ranker collaborators while collaborative clustering leverages strengths from multiple instance collaborators and clusterer collaborators. We select several typical NLP problems as our case studies including entity linking, document clustering and name entity clustering.